A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

Luiz Henrique Bonifacio; Paulo Arantes Vilela; Gustavo Rocha Lobato; Eraldo Rezende Fernandes

doi:10.1007/978-3-030-61377-8_46

A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet

Authors

Luiz Henrique Bonifacio
Paulo Arantes Vilela
Gustavo Rocha Lobato
Eraldo Rezende Fernandes

Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.

Originalsprache	Englisch
Titel	Intelligent Systems : 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I
Herausgeber	Ricardo Cerri, Ronaldo C. Prati
Anzahl der Seiten	15
Erscheinungsort	Cham
Verlag	Springer Nature Switzerland AG
Erscheinungsdatum	2020
Seiten	648-662
ISBN (Print)	978-3-030-61376-1
ISBN (elektronisch)	978-3-030-61377-8
DOIs	https://doi.org/10.1007/978-3-030-61377-8_46
Publikationsstatus	Erschienen - 2020
Extern publiziert	Ja
Veranstaltung	Brazilian Conference on Intelligent Systems - BRACIS 2020 - Rio Grande, Brasilien Dauer: 20.10.2020 → 23.10.2020 Konferenznummer: 9 http://www2.sbc.org.br/bracis2020/#:~:text=The%209th%20Brazilian%20Conference%20on,%2C%2020%20to%2023%2C%202020.

Fachgebiete

Informatik
Wirtschaftsinformatik

Weitere Publikationen dieser Person(en)

Data practices in apps from Brazil: What do privacy policies inform us about?

Quadros dos Reis, V., Rabello, M. E. R., Lima, A. C., Jardim, G. P. S., Fernandes, E. R. & Brefeld, U., 10.02.2023, in: Journal on Interactive Systems. 14, 1, S. 1-8 8 S.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

Entity Extraction from Portuguese Legal Documents Using Distant Supervision

Navarezi, L. M., Sakiyama, K., Rodrigues, L. S., Robaldo, C. M. O., Lobato, G. R., Vilela, P. A., Matsubara, E. T. & Fernandes, E. R., 2022, Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C. & Pinto, H. (Hrsg.). Cham: Springer Nature Switzerland AG, S. 166-176 11 S. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 13208 LNAI).

Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet

FaST: A linear time stack trace alignment heuristic for crash report deduplication

Rodrigues, I. M., Aloise, D. & Fernandes, E. R., 23.05.2022, The 2022 Mining Software Repositories Conference: MSR 2022, Proceedings; 18-20 May 2022, Virtual; 23-24 May 2022, Pittsburgh, Pennsylvania. New York: Institute of Electrical and Electronics Engineers Inc., S. 549-560 12 S. (Proceedings - IEEE/ACM International Conference on Mining Software Repositories ).

Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet

Performance predictors for graphics processing units applied to dark-silicon-aware design space exploration

Sonohata, R., Arigoni, D. C. A., Fernandes, E. R., Ribeiro dos Santos, R. & Dessandre Duenha, L., 01.08.2023, in: Concurrency and Computation: Practice and Experience. 35, 17, 16 S., e6877.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

TraceSim: An Alignment Method for Computing Stack Trace Similarity

Rodrigues, I. M., Khvorov, A., Aloise, D., Vasiliev, R., Koznov, D., Fernandes, E. R., Chernishev, G., Luciv, D. & Povarov, N., 01.03.2022, in: Empirical Software Engineering. 27, 2, 41 S., 53.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

DOI

https://doi.org/10.1007/978-3-030-61377-8_46
Endgültige, publizierte Fassung