Using Wikipedia for Cross-Language Named Entity Recognition

Eraldo R. Fernandes; Ulf Brefeld; Roi Blanco; Jordi Atserias

doi:10.1007/978-3-319-29009-6_1

Using Wikipedia for Cross-Language Named Entity Recognition

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

Authors

Eraldo R. Fernandes
Ulf Brefeld
Roi Blanco
Jordi Atserias

Professorship for Information Systems, in particular Machine Learning

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

Original language	English
Title of host publication	Big Data Analytics in the Social and Ubiquitous Context : 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers
Editors	Martin Atzmüller, Alvin Chin, Frederik Janssen, Immanuel Schweizer, Christoph Trattner
Number of pages	25
Publisher	Springer International Publishing
Publication date	2016
Pages	1-25
ISBN (print)	978-3-319-29008-9
ISBN (electronic)	978-3-319-29009-6
DOIs	https://doi.org/10.1007/978-3-319-29009-6_1
Publication status	Published - 2016
Event	5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014 - Nancy, France Duration: 15.09.2014 → 15.09.2014 Conference number: 5 https://www.semanticscholar.org/paper/The-Fifth-International-Workshop-on-Mining-and-Qin-Greene/03ed707786c842ce7a36b091457e1452d2723aec https://www.kde.cs.uni-kassel.de/wp-content/uploads/ws/muse2014/

Research areas

Business informatics - Hide Markov Model, Target Language, Conditional Random Field, Source Language, Entitiy Recognition

Other publications by the same author(s)

Interactive sequential generative models for team sports

Fassmeyer, D., Cordes, M. & Brefeld, U., 02.2025, In: Machine Learning. 114, 2, 15 p., 38.

Research output: Journal contributions › Journal articles › Research › peer-review

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Bengs, D., Brefeld, U., Kroehne, U. & Zehner, F., 01.09.2025, In: Psychometrika. 90, 4, p. 1346-1367 22 p.

Research output: Journal contributions › Journal articles › Research › peer-review

Machine Learning and Data Mining for Sports Analytics: 11th International Workshop, MLSA 2024, Vilnius, Lithuania, September 9, 2024, Revised Selected Papers

Brefeld, U. (Editor), Davis, J. (Editor), Van Haaren, J. (Editor) & Zimmermann, A. (Editor), 2025, Cham: Springer Verlag. 119 p. (Communications in Computer and Information Science; vol. 2460)

Research output: Books and anthologies › Conference proceedings › Research

Masked autoencoder for multiagent trajectories

Rudolph, Y. & Brefeld, U., 02.2025, In: Machine Learning. 114, 2, 18 p., 44.

Research output: Journal contributions › Journal articles › Research › peer-review

Self-improvement for Computerized Adaptive Testing

Rudolph, Y., Neubauer, K. & Brefeld, U., 2026, Machine Learning and Knowledge Discovery in Databases - Research Track: European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings. Ribeiro, R. P., Jorge, A. M., Soares, C., Gama, J., Pfahringer, B., Japkowicz, N., Larrañaga, P. & Abreu, P. H. (eds.). Cham: Springer International Publishing, Vol. 2. p. 70-86 17 p. (Lecture Notes in Computer Science; vol. 16014 LNCS).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

DOI

https://doi.org/10.1007/978-3-319-29009-6_1
Final published version

Using Wikipedia for Cross-Language Named Entity Recognition

Authors

Research areas

Other publications by the same author(s)

Interactive sequential generative models for team sports

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Machine Learning and Data Mining for Sports Analytics: 11th International Workshop, MLSA 2024, Vilnius, Lithuania, September 9, 2024, Revised Selected Papers

Masked autoencoder for multiagent trajectories

Self-improvement for Computerized Adaptive Testing

DOI

Recently viewed

Publications