Entity Extraction from Portuguese Legal Documents Using Distant Supervision

Lucas M. Navarezi; Kenzo Sakiyama; Lucas S. Rodrigues; Caio M.O. Robaldo; Gustavo R. Lobato; Paulo A. Vilela; Edson T. Matsubara; Eraldo R. Fernandes

doi:10.1007/978-3-030-98305-5_16

Entity Extraction from Portuguese Legal Documents Using Distant Supervision

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

Standard

Entity Extraction from Portuguese Legal Documents Using Distant Supervision. / Navarezi, Lucas M.; Sakiyama, Kenzo; Rodrigues, Lucas S. et al.
Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. ed. / Vládia Pinheiro; Pablo Gamallo; Raquel Amaro; Carolina Scarton; Fernando Batista; Diego Silva; Catarina Magro; Hugo Pinto. Cham: Springer Nature Switzerland AG, 2022. p. 166-176 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13208 LNAI).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

Harvard

Navarezi, LM, Sakiyama, K, Rodrigues, LS, Robaldo, CMO, Lobato, GR, Vilela, PA, Matsubara, ET & Fernandes, ER 2022, Entity Extraction from Portuguese Legal Documents Using Distant Supervision. in V Pinheiro, P Gamallo, R Amaro, C Scarton, F Batista, D Silva, C Magro & H Pinto (eds), Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13208 LNAI, Springer Nature Switzerland AG, Cham, pp. 166-176, 15th International Conference on the Computational Processing of Portuguese - PROPOR 2022, Fortaleza, Brazil, 21.03.22. https://doi.org/10.1007/978-3-030-98305-5_16

APA

Navarezi, L. M., Sakiyama, K., Rodrigues, L. S., Robaldo, C. M. O., Lobato, G. R., Vilela, P. A., Matsubara, E. T., & Fernandes, E. R. (2022). Entity Extraction from Portuguese Legal Documents Using Distant Supervision. In V. Pinheiro, P. Gamallo, R. Amaro, C. Scarton, F. Batista, D. Silva, C. Magro, & H. Pinto (Eds.), Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings (pp. 166-176). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13208 LNAI). Springer Nature Switzerland AG. https://doi.org/10.1007/978-3-030-98305-5_16

Vancouver

Navarezi LM, Sakiyama K, Rodrigues LS, Robaldo CMO, Lobato GR, Vilela PA et al. Entity Extraction from Portuguese Legal Documents Using Distant Supervision. In Pinheiro V, Gamallo P, Amaro R, Scarton C, Batista F, Silva D, Magro C, Pinto H, editors, Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. Cham: Springer Nature Switzerland AG. 2022. p. 166-176. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-98305-5_16

Bibtex

@inbook{901a8941060b444d887907898cf70daa,

title = "Entity Extraction from Portuguese Legal Documents Using Distant Supervision",

abstract = "Most approaches to role-filler entity extraction (REE) rely on large labeled training corpora in which entity mentions are directly annotated in the input document. In this work, we leverage an existing knowledge base (KB) of entities to perform document-level REE from drug seizure petitions. We propose a system that learns to extract entities from petitions to fill 29 roles of a drug seizure event. Although we have access to a KB covering more than 170 thousand entities and six thousand petitions, such that each entity in the KB is linked to a specific petition, the mentions to an entity within a petition{\textquoteright}s text are not annotated. The lack of these annotations brings challenges related to mismatches between entity values in the KB and entity mentions in the documents. Additionally, there are entities with same type or same value. Thus, we propose a distant annotation method to overcome these challenges and automatically label petition documents using the available KB. This annotation method includes a parameter that controls the balance between precision and recall. We also propose a strategy to effectively tune this parameter in order to optimize a given metric. We then train a BERT-based sequence labeling model that learns to identify entity mentions and label them. Our system achieves an F1 score of 78.59 with precision over 82%. We also report ablation studies regarding the distant annotation method.",

keywords = "BERT, Entity extraction, NLP, Business informatics, Informatics",

author = "Navarezi, {Lucas M.} and Kenzo Sakiyama and Rodrigues, {Lucas S.} and Robaldo, {Caio M.O.} and Lobato, {Gustavo R.} and Vilela, {Paulo A.} and Matsubara, {Edson T.} and Fernandes, {Eraldo R.}",

year = "2022",

doi = "10.1007/978-3-030-98305-5_16",

language = "English",

isbn = "978-3-030-98304-8",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Nature Switzerland AG",

pages = "166--176",

editor = "Vl{\'a}dia Pinheiro and Pablo Gamallo and Raquel Amaro and Carolina Scarton and Fernando Batista and Diego Silva and Catarina Magro and Hugo Pinto",

booktitle = "Computational Processing of the Portuguese Language",

address = "Switzerland",

note = "15th International Conference on the Computational Processing of Portuguese - PROPOR 2022, PROPOR 2022 ; Conference date: 21-03-2022 Through 23-03-2022",

url = "https://www.aclweb.org/portal/content/propor-2022-15th-international-conference-computational-processing-portuguese",

}

RIS

TY - CHAP

T1 - Entity Extraction from Portuguese Legal Documents Using Distant Supervision

AU - Navarezi, Lucas M.

AU - Sakiyama, Kenzo

AU - Rodrigues, Lucas S.

AU - Robaldo, Caio M.O.

AU - Lobato, Gustavo R.

AU - Vilela, Paulo A.

AU - Matsubara, Edson T.

AU - Fernandes, Eraldo R.

PY - 2022

Y1 - 2022

N2 - Most approaches to role-filler entity extraction (REE) rely on large labeled training corpora in which entity mentions are directly annotated in the input document. In this work, we leverage an existing knowledge base (KB) of entities to perform document-level REE from drug seizure petitions. We propose a system that learns to extract entities from petitions to fill 29 roles of a drug seizure event. Although we have access to a KB covering more than 170 thousand entities and six thousand petitions, such that each entity in the KB is linked to a specific petition, the mentions to an entity within a petition’s text are not annotated. The lack of these annotations brings challenges related to mismatches between entity values in the KB and entity mentions in the documents. Additionally, there are entities with same type or same value. Thus, we propose a distant annotation method to overcome these challenges and automatically label petition documents using the available KB. This annotation method includes a parameter that controls the balance between precision and recall. We also propose a strategy to effectively tune this parameter in order to optimize a given metric. We then train a BERT-based sequence labeling model that learns to identify entity mentions and label them. Our system achieves an F1 score of 78.59 with precision over 82%. We also report ablation studies regarding the distant annotation method.

AB - Most approaches to role-filler entity extraction (REE) rely on large labeled training corpora in which entity mentions are directly annotated in the input document. In this work, we leverage an existing knowledge base (KB) of entities to perform document-level REE from drug seizure petitions. We propose a system that learns to extract entities from petitions to fill 29 roles of a drug seizure event. Although we have access to a KB covering more than 170 thousand entities and six thousand petitions, such that each entity in the KB is linked to a specific petition, the mentions to an entity within a petition’s text are not annotated. The lack of these annotations brings challenges related to mismatches between entity values in the KB and entity mentions in the documents. Additionally, there are entities with same type or same value. Thus, we propose a distant annotation method to overcome these challenges and automatically label petition documents using the available KB. This annotation method includes a parameter that controls the balance between precision and recall. We also propose a strategy to effectively tune this parameter in order to optimize a given metric. We then train a BERT-based sequence labeling model that learns to identify entity mentions and label them. Our system achieves an F1 score of 78.59 with precision over 82%. We also report ablation studies regarding the distant annotation method.

KW - BERT

KW - Entity extraction

KW - NLP

KW - Business informatics

KW - Informatics

UR - http://www.scopus.com/inward/record.url?scp=85127166992&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/88c72c46-972e-34e1-b04c-2b883dd525c5/

U2 - 10.1007/978-3-030-98305-5_16

DO - 10.1007/978-3-030-98305-5_16

M3 - Article in conference proceedings

AN - SCOPUS:85127166992

SN - 978-3-030-98304-8

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 166

EP - 176

BT - Computational Processing of the Portuguese Language

A2 - Pinheiro, Vládia

A2 - Gamallo, Pablo

A2 - Amaro, Raquel

A2 - Scarton, Carolina

A2 - Batista, Fernando

A2 - Silva, Diego

A2 - Magro, Catarina

A2 - Pinto, Hugo

PB - Springer Nature Switzerland AG

CY - Cham

T2 - 15th International Conference on the Computational Processing of Portuguese - PROPOR 2022

Y2 - 21 March 2022 through 23 March 2022

ER -

Other publications by the same author(s)

Data practices in apps from Brazil: What do privacy policies inform us about?

Quadros dos Reis, V., Rabello, M. E. R., Lima, A. C., Jardim, G. P. S., Fernandes, E. R. & Brefeld, U., 10.02.2023, In: Journal on Interactive Systems. 14, 1, p. 1-8 8 p.

Research output: Journal contributions › Journal articles › Research › peer-review

FaST: A linear time stack trace alignment heuristic for crash report deduplication

Rodrigues, I. M., Aloise, D. & Fernandes, E. R., 17.10.2022, The 2022 Mining Software Repositories Conference: MSR 2022, Proceedings; 18-20 May 2022, Virtual; 23-24 May 2022, Pittsburgh, Pennsylvania. New York: Institute of Electrical and Electronics Engineers Inc., p. 549-560 12 p. (Proceedings - IEEE/ACM International Conference on Mining Software Repositories ).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

Performance predictors for graphics processing units applied to dark-silicon-aware design space exploration

Sonohata, R., Arigoni, D. C. A., Fernandes, E. R., Ribeiro dos Santos, R. & Dessandre Duenha, L., 01.08.2023, In: Concurrency and Computation: Practice and Experience. 35, 17, 16 p., e6877.

Research output: Journal contributions › Journal articles › Research › peer-review

TraceSim: An Alignment Method for Computing Stack Trace Similarity

Rodrigues, I. M., Khvorov, A., Aloise, D., Vasiliev, R., Koznov, D., Fernandes, E. R., Chernishev, G., Luciv, D. & Povarov, N., 01.03.2022, In: Empirical Software Engineering. 27, 2, 41 p., 53.

Research output: Journal contributions › Journal articles › Research › peer-review

Rhetorical Role Identification for Portuguese Legal Documents

Aragy, R., Fernandes, E. R. & Caceres, E. N., 2021, Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29 – December 3, 2021, Proceedings, Part II. Britto, A. & Valdivia Delgado, K. (eds.). Cham: Springer Schweiz, p. 557-571 15 p. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); vol. 13074 LNAI).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

DOI

https://doi.org/10.1007/978-3-030-98305-5_16
Final published version

Entity Extraction from Portuguese Legal Documents Using Distant Supervision

Standard

Harvard

APA

Vancouver

Bibtex

RIS

Other publications by the same author(s)

Data practices in apps from Brazil: What do privacy policies inform us about?

FaST: A linear time stack trace alignment heuristic for crash report deduplication

Performance predictors for graphics processing units applied to dark-silicon-aware design space exploration

TraceSim: An Alignment Method for Computing Stack Trace Similarity

Rhetorical Role Identification for Portuguese Legal Documents

DOI

Recently viewed

Researchers

Projects

Activities

Publications

Press / Media