Using Wikipedia for Cross-Language Named Entity Recognition

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Authors

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

Original languageEnglish
Title of host publicationBig Data Analytics in the Social and Ubiquitous Context : 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers
EditorsMartin Atzmüller, Alvin Chin, Frederik Janssen, Immanuel Schweizer, Christoph Trattner
Number of pages25
PublisherSpringer International Publishing
Publication date2016
Pages1-25
ISBN (print)978-3-319-29008-9
ISBN (electronic)978-3-319-29009-6
DOIs
Publication statusPublished - 2016
Event 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014 - Nancy, France
Duration: 15.09.201415.09.2014
Conference number: 5
https://www.semanticscholar.org/paper/The-Fifth-International-Workshop-on-Mining-and-Qin-Greene/03ed707786c842ce7a36b091457e1452d2723aec
https://www.kde.cs.uni-kassel.de/wp-content/uploads/ws/muse2014/

    Research areas

  • Business informatics - Hide Markov Model, Target Language, Conditional Random Field, Source Language, Entitiy Recognition

Recently viewed

Publications

  1. Promising practices for dealing with complexity in research for development
  2. A denoising procedure using wavelet packets for instantaneous detection of pantograph oscillations
  3. XOperator - An extensible semantic agent for instant messaging networks
  4. Don’t underestimate the problems of user centredness in software development projectsthere are many!?
  5. Integration of Environmental Management Information Systems and ERP systems using Integration Platforms
  6. Enhancing the Building Information Modeling Lifecycle of Complex Structures with IoT
  7. Continuous and Discrete Concepts for Detecting Transport Barriers in the Planar Circular Restricted Three Body Problem
  8. Energy Optimization in Motion Planning of a Two-Link Manipulator using Bernstein Polynomials
  9. Trajectory-based computational study of coherent behavior in flows
  10. Teachers’ use of data from digital learning platforms for instructional design
  11. Control versus Complexity
  12. Q-Adaptive Control of the nonlinear dynamics of the cantilever-sample system of an Atomic Force Microscope
  13. Control of an Electromagnetic Linear Actuator Using Flatness Property and Systems Inversion
  14. A PHENOMENOGRAPHICAL STUDY OF CHILDRENS’ SPATIAL THOUGHT WHILE USING MAPS IN REAL SPACES
  15. Switching between reading tasks leads to phase-transitions in reading times in L1 and L2 readers
  16. Children's use of spatial skills in solving two map-reading tasks in real space.
  17. Selecting and Adapting Methods for Analysis and Design in Value-Sensitive Digital Social Innovation Projects: Toward Design Principles
  18. The effects of different on-line adaptive response time limits on speed and amount of learning in computer assisted instruction and intelligent tutoring
  19. Design and Control of an Inductive Power Transmission System with AC-AC Converter for a Constant Output Current
  20. Control of a Three-Axis Robot with Super Twisting Sliding Mode Control
  21. Convolutional Neural Networks
  22. Intersection tests for the cointegrating rank in dependent panel data
  23. Topic Embeddings – A New Approach to Classify Very Short Documents Based on Predefined Topics
  24. Reading and Calculating in Word Problem Solving
  25. On robustness properties in permanent magnet machine control by using decoupling controller
  26. Using heuristic worked examples to promote solving of reality‑based tasks in mathematics in lower secondary school
  27. Cognitive load and instructionally supported learning with provided and learner-generated visualizations
  28. A tutorial introduction to adaptive fractal analysis
  29. Situated multiplying in primary school
  30. Validation of an open source, remote web-based eye-tracking method (WebGazer) for research in early childhood