Entity Extraction from Portuguese Legal Documents Using Distant Supervision
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. ed. / Vládia Pinheiro; Pablo Gamallo; Raquel Amaro; Carolina Scarton; Fernando Batista; Diego Silva; Catarina Magro; Hugo Pinto. Cham: Springer Nature Switzerland AG, 2022. p. 166-176 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13208 LNAI).
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - Entity Extraction from Portuguese Legal Documents Using Distant Supervision
AU - Navarezi, Lucas M.
AU - Sakiyama, Kenzo
AU - Rodrigues, Lucas S.
AU - Robaldo, Caio M.O.
AU - Lobato, Gustavo R.
AU - Vilela, Paulo A.
AU - Matsubara, Edson T.
AU - Fernandes, Eraldo R.
PY - 2022
Y1 - 2022
N2 - Most approaches to role-filler entity extraction (REE) rely on large labeled training corpora in which entity mentions are directly annotated in the input document. In this work, we leverage an existing knowledge base (KB) of entities to perform document-level REE from drug seizure petitions. We propose a system that learns to extract entities from petitions to fill 29 roles of a drug seizure event. Although we have access to a KB covering more than 170 thousand entities and six thousand petitions, such that each entity in the KB is linked to a specific petition, the mentions to an entity within a petition’s text are not annotated. The lack of these annotations brings challenges related to mismatches between entity values in the KB and entity mentions in the documents. Additionally, there are entities with same type or same value. Thus, we propose a distant annotation method to overcome these challenges and automatically label petition documents using the available KB. This annotation method includes a parameter that controls the balance between precision and recall. We also propose a strategy to effectively tune this parameter in order to optimize a given metric. We then train a BERT-based sequence labeling model that learns to identify entity mentions and label them. Our system achieves an F1 score of 78.59 with precision over 82%. We also report ablation studies regarding the distant annotation method.
AB - Most approaches to role-filler entity extraction (REE) rely on large labeled training corpora in which entity mentions are directly annotated in the input document. In this work, we leverage an existing knowledge base (KB) of entities to perform document-level REE from drug seizure petitions. We propose a system that learns to extract entities from petitions to fill 29 roles of a drug seizure event. Although we have access to a KB covering more than 170 thousand entities and six thousand petitions, such that each entity in the KB is linked to a specific petition, the mentions to an entity within a petition’s text are not annotated. The lack of these annotations brings challenges related to mismatches between entity values in the KB and entity mentions in the documents. Additionally, there are entities with same type or same value. Thus, we propose a distant annotation method to overcome these challenges and automatically label petition documents using the available KB. This annotation method includes a parameter that controls the balance between precision and recall. We also propose a strategy to effectively tune this parameter in order to optimize a given metric. We then train a BERT-based sequence labeling model that learns to identify entity mentions and label them. Our system achieves an F1 score of 78.59 with precision over 82%. We also report ablation studies regarding the distant annotation method.
KW - BERT
KW - Entity extraction
KW - NLP
KW - Business informatics
KW - Informatics
UR - http://www.scopus.com/inward/record.url?scp=85127166992&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/88c72c46-972e-34e1-b04c-2b883dd525c5/
U2 - 10.1007/978-3-030-98305-5_16
DO - 10.1007/978-3-030-98305-5_16
M3 - Article in conference proceedings
AN - SCOPUS:85127166992
SN - 978-3-030-98304-8
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 166
EP - 176
BT - Computational Processing of the Portuguese Language
A2 - Pinheiro, Vládia
A2 - Gamallo, Pablo
A2 - Amaro, Raquel
A2 - Scarton, Carolina
A2 - Batista, Fernando
A2 - Silva, Diego
A2 - Magro, Catarina
A2 - Pinto, Hugo
PB - Springer Nature Switzerland AG
CY - Cham
T2 - 15th International Conference on the Computational Processing of Portuguese - PROPOR 2022
Y2 - 21 March 2022 through 23 March 2022
ER -