Using Wikipedia for Cross-Language Named Entity Recognition

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Authors

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

Original languageEnglish
Title of host publicationBig Data Analytics in the Social and Ubiquitous Context : 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers
EditorsMartin Atzmüller, Alvin Chin, Frederik Janssen, Immanuel Schweizer, Christoph Trattner
Number of pages25
PublisherSpringer International Publishing
Publication date2016
Pages1-25
ISBN (print)978-3-319-29008-9
ISBN (electronic)978-3-319-29009-6
DOIs
Publication statusPublished - 2016
Event 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014 - Nancy, France
Duration: 15.09.201415.09.2014
Conference number: 5
https://www.semanticscholar.org/paper/The-Fifth-International-Workshop-on-Mining-and-Qin-Greene/03ed707786c842ce7a36b091457e1452d2723aec
https://www.kde.cs.uni-kassel.de/wp-content/uploads/ws/muse2014/

    Research areas

  • Business informatics - Hide Markov Model, Target Language, Conditional Random Field, Source Language, Entitiy Recognition

Recently viewed

Publications

  1. Sensitivity to complexity - an important prerequisite of problem solving mathematics teaching
  2. Positioning Improvement for a Laser Scanning System using cSORPD control
  3. A general structural property in wavelet packets for detecting oscillation and noise components in signal analysis
  4. Assembly Modes of General Planar 3-RPR Parallel Mechanisms when Using the Linear Actuators’ Orientations
  5. Design, Modeling and Control of an Over-actuated Hexacopter Tilt-Rotor
  6. A Sensitive Microsystem as Biosensor for Cell Growth Monitoring and Antibiotic Testing
  7. Estimated substitution elasticities of a nested CES production function approach for Germany
  8. Phase Shift APOD and POD Control Technique in Multi-Level Inverters to Mitigate Total Harmonic Distortion
  9. Short run comovement, persistent shocks and the business cycle
  10. Das John-Stuart-Mill-Problem
  11. A matrix of evaluation and comparsion of Case-Based Reasoning (CBR) software tools to facilitate understanding and appreciation
  12. Vergütung, variable
  13. Augmented space
  14. Ob lang oder kurz, berührbar oder nicht: Ist die Längenschätzkompetenz eindimensional?
  15. Same but different? Measurement invariance of the PIAAC motivation-to-learn scale across key socio-demographic groups
  16. Using density surface models to assess the ecological effectiveness of a protected area network in Tanzania
  17. Paired case research design and mixed-methods approach
  18. How data on transformation products can support the redesign of sulfonamides towards better biodegradability in the environment
  19. Introduction
  20. Active suspensions decoupling by algebraic feedback
  21. Embedded, not plugged-in
  22. Current and New Research Perspectives on Dynamic Facial Emotion Detection in Emotional Interface
  23. Visions of Process—Swarm Intelligence and Swarm Robotics in Architectural Design and Construction
  24. Assuring a safe, secure and sustainable space environment for space activities
  25. Germany Humboldt University in Berlin: Its Transformation in the Process of German Unification
  26. Water quantity and quality in the Zerafshan river basin - only an upstream riparian problem?
  27. The Effect of Solid Solute and Precipitate Phase on Young's Modulus of Binary Mg–RE Alloys
  28. Effect of laser peening process parameters and sequences on residual stress profiles
  29. Methodology for Integrating Biomimetic Beams in Abstracted Topology Optimization Results
  30. Brain Drain
  31. Way out of the Supply Crises through Risk Minimization - Metrological Comparison of two Polypropylene Materials and Examination with Six Sigma Methods