A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Standard

A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. / Bonifacio, Luiz Henrique; Vilela, Paulo Arantes; Lobato, Gustavo Rocha et al.
Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I. ed. / Ricardo Cerri; Ronaldo C. Prati. Cham: Springer Nature Switzerland AG, 2020. p. 648-662 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12319 LNAI).

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Harvard

Bonifacio, LH, Vilela, PA, Lobato, GR & Fernandes, ER 2020, A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. in R Cerri & RC Prati (eds), Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12319 LNAI, Springer Nature Switzerland AG, Cham, pp. 648-662, Brazilian Conference on Intelligent Systems - BRACIS 2020, Rio Grande, Brazil, 20.10.20. https://doi.org/10.1007/978-3-030-61377-8_46

APA

Bonifacio, L. H., Vilela, P. A., Lobato, G. R., & Fernandes, E. R. (2020). A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. In R. Cerri, & R. C. Prati (Eds.), Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I (pp. 648-662). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12319 LNAI). Springer Nature Switzerland AG. https://doi.org/10.1007/978-3-030-61377-8_46

Vancouver

Bonifacio LH, Vilela PA, Lobato GR, Fernandes ER. A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. In Cerri R, Prati RC, editors, Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I. Cham: Springer Nature Switzerland AG. 2020. p. 648-662. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-61377-8_46

Bibtex

@inbook{aefc1e9b06a54d238646d433c41dcb30,
title = "A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese",
abstract = "Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.",
keywords = "Deep learning, Named entity recognition, Natural language processing, Informatics, Business informatics",
author = "Bonifacio, {Luiz Henrique} and Vilela, {Paulo Arantes} and Lobato, {Gustavo Rocha} and Fernandes, {Eraldo Rezende}",
year = "2020",
doi = "10.1007/978-3-030-61377-8_46",
language = "English",
isbn = "978-3-030-61376-1",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Nature Switzerland AG",
pages = "648--662",
editor = "Ricardo Cerri and Prati, {Ronaldo C.}",
booktitle = "Intelligent Systems",
address = "Switzerland",
note = "Brazilian Conference on Intelligent Systems - BRACIS 2020 ; Conference date: 20-10-2020 Through 23-10-2020",
url = "http://www2.sbc.org.br/bracis2020/#:~:text=The%209th%20Brazilian%20Conference%20on,%2C%2020%20to%2023%2C%202020.",

}

RIS

TY - CHAP

T1 - A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

AU - Bonifacio, Luiz Henrique

AU - Vilela, Paulo Arantes

AU - Lobato, Gustavo Rocha

AU - Fernandes, Eraldo Rezende

N1 - Conference code: 9

PY - 2020

Y1 - 2020

N2 - Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.

AB - Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.

KW - Deep learning

KW - Named entity recognition

KW - Natural language processing

KW - Informatics

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=85094121387&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-61377-8_46

DO - 10.1007/978-3-030-61377-8_46

M3 - Article in conference proceedings

AN - SCOPUS:85094121387

SN - 978-3-030-61376-1

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 648

EP - 662

BT - Intelligent Systems

A2 - Cerri, Ricardo

A2 - Prati, Ronaldo C.

PB - Springer Nature Switzerland AG

CY - Cham

T2 - Brazilian Conference on Intelligent Systems - BRACIS 2020

Y2 - 20 October 2020 through 23 October 2020

ER -

Recently viewed

Publications

  1. Psychologie in der Lehrerbildung: Didaktische Konzeption zur Förderung von Conceptual Change, Selbstlern- und Reflexionskompetenz
  2. Agrifood Tourism, Rural Resilience, and Recovery in a Postdisaster Context: Insights and Evidence From Kaikoura-hurunui, New Zealand
  3. [Review] Tracy McDonald e Daniel Vandersommers (a cura di), Zoo Studies. A New Humanities, Montreal, McGill¿Queen¿s University Press, 2019, 345 pp.
  4. Developing learning environments for independent work – preparing Austrian future chemistry teachers for inquiry-based science education
  5. Death Valley ‘69, Nymphomania, Rat Trap, Baby Doll, Submit To Me Now, Simonland, Horoscope, Go to Hell, The Bogus Man (Einträge zu ausgestellten Filmen)
  6. Implikationen der Effectuation-Theorie für die Entrepreneurship Education - Geschäftsmodellentwicklung zur Förderung unternehmerischen Potenzials
  7. Die Operationalisierung nachhaltiger Strategiepfade für die deutsche Energieversorgung unter besonderer Berücksichtigung der erneuerbaren Energien
  8. Die Vollstreckung ausländischer freiheitsentziehender Strafurteile über das innerstaatliche Höchstmaß hinaus – eine kritische Analyse des § 54a IRG
  9. Tree ring isotopic composition, radial increment and height growth reveal provenance-specific reactions of Douglas-fir towards environmental parameters
  10. Richard Caddell and Erik J. Molenaar, eds., Strengthening International Fisheries Law in an Era of Changing Oceans (Oxford/Portland: Hart Publishing, 2019), 512 pp.
  11. German nuclear phase-out enters the next stage: Electricity supply remains secure - Major challenges and high costs for dismantling and final waste disposal
  12. Double-Click on London - Fünf Webunits zu den Themen "London Transport", "Jack the Ripper", "Cockney", "Pubs and Drugs" und "Studying in London" für die gymnasiale Oberstufe
  13. Lehrphilosophien als Reflexionsinstrument: Berufsbezogene Überzeugungen angehender Deutsch als Zweit- und Fremdsprachenlehrkräfte zwischen Wunsch und Wirklichkeit
  14. Membranbioreaktoren als Vorbehandlung bei der Elimination von Arzneimittelrückständen und Krankheitserregern aus Abwasser von Einrichtungen des Gesundheitswesens
  15. Länderleitentscheidungen durch das Bundesverwaltungsgericht. Der Gesetzesentwurf der Bundesregierung zur Beschleunigung der Asylgerichtsverfahren und Asylverfahren
  16. Nachhaltige Orientierungen bei Akteuren sozialer Bewegungen. Lokale Initiativen als Möglichkeitsraum lebenslangen Lernens im Kontext einer nachhaltigen Entwicklung
  17. Évaluation du potentiel allélopathique des composés hydrosolubles de l’orge (Hordeum vulgare L. subsp. vulgare) et du grand brome (Bromus diandrus Roth.) moyennant un bio-essai modifié
  18. Development and validation of a stability-indicating RP-HPLC method for the determination of paracetamol with dantrolene or/and cetirizine and pseudoephedrine in two pharmaceutical dosage forms
  19. Rezension zu "Neue Geschlechterbilder ? Die Präsenz und Darstellung von Politikerinnen in den Medien" in C. Holtz-Bacha (Hrsg.): Frauen, Politik und Medien, VS Verlag für Sozialwissenschaften 2008, ISBN: 978-3-531-15693-4