A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I. ed. / Ricardo Cerri; Ronaldo C. Prati. Cham: Springer Nature Switzerland AG, 2020. p. 648-662 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12319 LNAI).
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese
AU - Bonifacio, Luiz Henrique
AU - Vilela, Paulo Arantes
AU - Lobato, Gustavo Rocha
AU - Fernandes, Eraldo Rezende
N1 - Conference code: 9
PY - 2020
Y1 - 2020
N2 - Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.
AB - Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.
KW - Deep learning
KW - Named entity recognition
KW - Natural language processing
KW - Informatics
KW - Business informatics
UR - http://www.scopus.com/inward/record.url?scp=85094121387&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-61377-8_46
DO - 10.1007/978-3-030-61377-8_46
M3 - Article in conference proceedings
AN - SCOPUS:85094121387
SN - 978-3-030-61376-1
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 648
EP - 662
BT - Intelligent Systems
A2 - Cerri, Ricardo
A2 - Prati, Ronaldo C.
PB - Springer Nature Switzerland AG
CY - Cham
T2 - Brazilian Conference on Intelligent Systems - BRACIS 2020
Y2 - 20 October 2020 through 23 October 2020
ER -