Using Wikipedia for Cross-Language Named Entity Recognition

Publikation: Beiträge in SammelwerkenAufsätze in KonferenzbändenForschungbegutachtet

Standard

Using Wikipedia for Cross-Language Named Entity Recognition. / Fernandes, Eraldo R.; Brefeld, Ulf; Blanco, Roi et al.
Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers. Hrsg. / Martin Atzmüller; Alvin Chin; Frederik Janssen; Immanuel Schweizer; Christoph Trattner. Springer International Publishing, 2016. S. 1-25 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 9546).

Publikation: Beiträge in SammelwerkenAufsätze in KonferenzbändenForschungbegutachtet

Harvard

Fernandes, ER, Brefeld, U, Blanco, R & Atserias, J 2016, Using Wikipedia for Cross-Language Named Entity Recognition. in M Atzmüller, A Chin, F Janssen, I Schweizer & C Trattner (Hrsg.), Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Bd. 9546, Springer International Publishing, S. 1-25, 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014, Nancy, Frankreich, 15.09.14. https://doi.org/10.1007/978-3-319-29009-6_1

APA

Fernandes, E. R., Brefeld, U., Blanco, R., & Atserias, J. (2016). Using Wikipedia for Cross-Language Named Entity Recognition. In M. Atzmüller, A. Chin, F. Janssen, I. Schweizer, & C. Trattner (Hrsg.), Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers (S. 1-25). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 9546). Springer International Publishing. https://doi.org/10.1007/978-3-319-29009-6_1

Vancouver

Fernandes ER, Brefeld U, Blanco R, Atserias J. Using Wikipedia for Cross-Language Named Entity Recognition. in Atzmüller M, Chin A, Janssen F, Schweizer I, Trattner C, Hrsg., Big Data Analytics in the Social and Ubiquitous Context: 5th International Workshop on Modeling Social Media, MSM 2014, 5th International Workshop on Mining Ubiquitous and Social Environments, MUSE 2014, and First International Workshop on Machine Learning for Urban Sensor Data, SenseML 2014, Revised Selected Papers. Springer International Publishing. 2016. S. 1-25. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-29009-6_1

Bibtex

@inbook{6e4b3c84791249f2ad0e98fd7e464d1c,
title = "Using Wikipedia for Cross-Language Named Entity Recognition",
abstract = "Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.",
keywords = "Business informatics, Hide Markov Model, Target Language, Conditional Random Field, Source Language, Entitiy Recognition",
author = "Fernandes, {Eraldo R.} and Ulf Brefeld and Roi Blanco and Jordi Atserias",
year = "2016",
doi = "10.1007/978-3-319-29009-6_1",
language = "English",
isbn = "978-3-319-29008-9",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer International Publishing",
pages = "1--25",
editor = "Martin Atzm{\"u}ller and Alvin Chin and Frederik Janssen and Immanuel Schweizer and Christoph Trattner",
booktitle = "Big Data Analytics in the Social and Ubiquitous Context",
address = "Switzerland",
note = " 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014, MUSE 2014 ; Conference date: 15-09-2014 Through 15-09-2014",
url = "https://www.semanticscholar.org/paper/The-Fifth-International-Workshop-on-Mining-and-Qin-Greene/03ed707786c842ce7a36b091457e1452d2723aec, https://www.kde.cs.uni-kassel.de/wp-content/uploads/ws/muse2014/",

}

RIS

TY - CHAP

T1 - Using Wikipedia for Cross-Language Named Entity Recognition

AU - Fernandes, Eraldo R.

AU - Brefeld, Ulf

AU - Blanco, Roi

AU - Atserias, Jordi

N1 - Conference code: 5

PY - 2016

Y1 - 2016

N2 - Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

AB - Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

KW - Business informatics

KW - Hide Markov Model

KW - Target Language

KW - Conditional Random Field

KW - Source Language

KW - Entitiy Recognition

UR - http://www.scopus.com/inward/record.url?scp=84955265040&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-29009-6_1

DO - 10.1007/978-3-319-29009-6_1

M3 - Article in conference proceedings

SN - 978-3-319-29008-9

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 1

EP - 25

BT - Big Data Analytics in the Social and Ubiquitous Context

A2 - Atzmüller, Martin

A2 - Chin, Alvin

A2 - Janssen, Frederik

A2 - Schweizer, Immanuel

A2 - Trattner, Christoph

PB - Springer International Publishing

T2 - 5th International Workshop on Mining Ubiquitous and Social Environments - MUSE 2014

Y2 - 15 September 2014 through 15 September 2014

ER -

DOI

Zuletzt angesehen

Publikationen

  1. Pathways of Data-driven Business Model Design and Realization
  2. Contrasting requests in Inner Circle Englishes
  3. A Trajectory Generation Algorithm for Optimal Consumption in Electromagnetic Actuators
  4. Robust and Optimal Control Designed for Autonomous Surface Vessel Prototypes
  5. Development of a Parameterized Model for Additively Manufactured Dies to Control the Strains in Extrudates
  6. Stressing the Relevance of Differentiating between Systematic and Random Measurement Errors in Ultrasound Muscle Thickness Diagnostics
  7. How to move the transition to sustainable food consumption towards a societal tipping point
  8. Bayesian Parameter Estimation in Green Business Process Management
  9. Expectations on Hierarchical Scales of Discourse
  10. A Process Perspective on Organizational Failure
  11. Design of Reliable Remobilisation Finger Implants with Geometry Elements of a Triple Periodic Minimal Surface Structure via Additive Manufacturing of Silicon Nitride
  12. Emotional text design in multimedia learning
  13. Parameterized Synthetic Image Data Set for Fisheye Lens
  14. Evaluating A Teaching-Learning Sequence (TLS) About Acid-Base Reactions In Upper Secondary School
  15. A Multilevel CFA–MTMM Approach for Multisource Feedback Instruments
  16. Application of design of experiments for laser shock peening process optimization
  17. Explicit references in chat-based CSCL
  18. Interactive Media as Fields of Transduction
  19. On the role of linguistic features for comprehension and learning from STEM texts. A meta-analysis
  20. Developing robust field survey protocols in landscape ecology
  21. Challenging the status quo of accelerator research: Concluding remarks
  22. Effects of accuracy feedback on fractal characteristics of time estimation
  23. Hot tearing behaviour of binary Mg-1Al alloy using a contraction force measuring method
  24. Modeling Individual Differences in Children’s Information Integration During Pragmatic Word Learning
  25. Plant density modifies root system architecture in spring barley (Hordeum vulgare L.) through a change in nodal root number
  26. Towards a dynamic value network perspective of sustainable business models
  27. Modeling a modular omnidirectional AGV developmental platform with integrated suspension and power-plant