Using Wikipedia for Cross-Language Named Entity Recognition

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Standard

Using Wikipedia for Cross-Language Named Entity Recognition. / Fernandes, Eraldo R.; Brefeld, Ulf; Blanco, Roi et al.
Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers. ed. / Martin Atzmüller; Alvin Chin; Frederik Janssen; Immanuel Schweizer; Christoph Trattner. Springer International Publishing AG, 2016. p. 1-25 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9546).

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Harvard

Fernandes, ER, Brefeld, U, Blanco, R & Atserias, J 2016, Using Wikipedia for Cross-Language Named Entity Recognition. in M Atzmüller, A Chin, F Janssen, I Schweizer & C Trattner (eds), Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9546, Springer International Publishing AG, pp. 1-25, 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014, Nancy, France, 15.09.14. https://doi.org/10.1007/978-3-319-29009-6_1

APA

Fernandes, E. R., Brefeld, U., Blanco, R., & Atserias, J. (2016). Using Wikipedia for Cross-Language Named Entity Recognition. In M. Atzmüller, A. Chin, F. Janssen, I. Schweizer, & C. Trattner (Eds.), Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers (pp. 1-25). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9546). Springer International Publishing AG. https://doi.org/10.1007/978-3-319-29009-6_1

Vancouver

Fernandes ER, Brefeld U, Blanco R, Atserias J. Using Wikipedia for Cross-Language Named Entity Recognition. In Atzmüller M, Chin A, Janssen F, Schweizer I, Trattner C, editors, Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers. Springer International Publishing AG. 2016. p. 1-25. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-29009-6_1

Bibtex

@inbook{6e4b3c84791249f2ad0e98fd7e464d1c,
title = "Using Wikipedia for Cross-Language Named Entity Recognition",
abstract = "Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.",
keywords = "Business informatics, Hide Markov Model, Target Language, Conditional Random Field, Source Language, Entitiy Recognition",
author = "Fernandes, {Eraldo R.} and Ulf Brefeld and Roi Blanco and Jordi Atserias",
year = "2016",
doi = "10.1007/978-3-319-29009-6_1",
language = "English",
isbn = "978-3-319-29008-9",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer International Publishing AG",
pages = "1--25",
editor = "Martin Atzm{\"u}ller and Alvin Chin and Frederik Janssen and Immanuel Schweizer and Christoph Trattner",
booktitle = "Big Data Analytics in the Social and Ubiquitous Context",
address = "Switzerland",
note = " 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014, MUSE 2014 ; Conference date: 15-09-2014 Through 15-09-2014",
url = "https://www.semanticscholar.org/paper/The-Fifth-International-Workshop-on-Mining-and-Qin-Greene/03ed707786c842ce7a36b091457e1452d2723aec, https://www.kde.cs.uni-kassel.de/wp-content/uploads/ws/muse2014/",

}

RIS

TY - CHAP

T1 - Using Wikipedia for Cross-Language Named Entity Recognition

AU - Fernandes, Eraldo R.

AU - Brefeld, Ulf

AU - Blanco, Roi

AU - Atserias, Jordi

N1 - Conference code: 5

PY - 2016

Y1 - 2016

N2 - Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

AB - Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

KW - Business informatics

KW - Hide Markov Model

KW - Target Language

KW - Conditional Random Field

KW - Source Language

KW - Entitiy Recognition

UR - http://www.scopus.com/inward/record.url?scp=84955265040&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-29009-6_1

DO - 10.1007/978-3-319-29009-6_1

M3 - Article in conference proceedings

SN - 978-3-319-29008-9

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 1

EP - 25

BT - Big Data Analytics in the Social and Ubiquitous Context

A2 - Atzmüller, Martin

A2 - Chin, Alvin

A2 - Janssen, Frederik

A2 - Schweizer, Immanuel

A2 - Trattner, Christoph

PB - Springer International Publishing AG

T2 - 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014

Y2 - 15 September 2014 through 15 September 2014

ER -

Recently viewed

Publications

  1. Die Angst vor Migranten. Gefühle als Modus des politischen Denkens
  2. B wie Bürokratische Tugenden. Ein Handbrevier für den Büromenschen
  3. Macroecological patterns of spider species richness across Europe
  4. Solidarität mit den Anderen. Gesellschaft und Regime der Alterität
  5. Die Wahlrechtssysteme in Mittel- und Osteuropa, Gerrit Manssen ...(Hrsg.)
  6. Fluid-structure interaction modelling of a soft pneumatic actuator
  7. Working time preferences and early and late retirement intentions
  8. Microtomography on biomaterials using the harwi-2 beamline at desy
  9. "The Development of the Turtle Carapace" (1989), by Ann Campbell Burke
  10. ‚Permanenter Ausnahmezustand’, ‚Netzkrieg’ oder doch ‚Zivilgesellschaft’?
  11. Appropriating mobility and bordering Europe through romantic love
  12. Soziale Differenzkategorien als Gegenstand der Lehrer*innenbildung
  13. Editorial: Innovation und Forschung in der Arbeits(zeit)organisation.
  14. Perspektiven großflächiger Beweidungssysteme für den Naturschutz
  15. Sustainable Energy: Risks and Opportunities of Biomass for Biofuel
  16. What motivates teachers to participate in professional development?
  17. Lernbegleitung durch Unterrichtsbesprechungen im Langzeitpraktikum
  18. The application of environmental ethics in biological conservation
  19. Herausforderungen für die Theoriebildung in der Familiensoziologie
  20. The overburdened mother: How social workers view the private sphere
  21. Habitat use by European wildcats (Felis silvestris) in central Spain
  22. Palaeogeography and palaeoecology of Carabus auronitens (Coleoptera)
  23. Daily Antecedents and Consequences of Deep Acting Toward Coworkers
  24. Machining of hybrid reinforced Mg-MMCs using abrasive water jetting
  25. Entgrenzung bei ortsflexibler Arbeit und berufsbedingter Mobilität
  26. Fairness und Kündigungen – eine theoretische und empirische Analyse
  27. Poststrukturalistische Kritik als Praxis von Grenzüberschreitungen
  28. Scenes of Empowerment: Virtual Racial Diversity and Digital Divides
  29. Not Ready to Make Nice – Macht und Bedrohung in der populären Musik
  30. Modeling of microstructural pattern formation in crystal plasticity
  31. Exploring the Hidden Curriculum in Responsible Management Education
  32. Rezension Chris Porter, 2019, Supporter Ownership in English Football
  33. Der Widerstand gegen die Diktatur und das neue Bild von Deutschland
  34. "Meine Seele verblutet an der Sehnsucht nach dem verlorenen Paradies"
  35. Spatial characterization of coastal marine social-ecological systems
  36. Magnesium-based metal matrix nanocomposites—processing and properties
  37. Überstunden, Ausgleichsmöglichkeiten, Gesundheit und Work-Life-Balance
  38. The Effect of Product Regulation on Business Global Competitiveness
  39. Surface Integration: Dealing with the EES and the OMC/incl. in Germany
  40. Higher Wages in Exporting Firms: Self-Selection, Export Effect, or Both?
  41. Biodiversity in space and time - towards a grid mapping for Mongolia
  42. Challenges and opportunities for sustainable development in Germany
  43. Next generation wireless energy aware sensors for internet of things
  44. New ways in engineering education for a sustainable and smart future
  45. Comparative Study of AC-DC Rectifiers for Vibration Energy Harvesters
  46. Transcending Methodological Nationalism through a Transversal Method?
  47. A checklist for ecological management of landscapes for conservation
  48. Eine Kultur des Zweifels. Kinderlosigkeit und die Zukunft der Familie.
  49. Herausforderungen des kulturellen Wandels in Richtung Nachhaltigkeit