A machine learning approach to Portuguese clause identification
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
Computational Processing of the Portuguese Language: 9th International Conference, PROPOR 2010, Porto Alegre, RS, Brazil, April 27-30, 2010. Proceedings. ed. / Thiago Alexandre Salgueiro Pardo; Antonio Branco; Aldebaro Klautau; Renata Viera; Vera Lucia Strube de Lima. Berlin, Heidelberg: Springer, 2010. p. 55-64 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6001 LNAI).
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - A machine learning approach to Portuguese clause identification
AU - Fernandes, Eraldo R.
AU - Dos Santos, Cícero N.
AU - Milidiú, Ruy L.
N1 - Conference code: 9
PY - 2010
Y1 - 2010
N2 - In this work, we apply and evaluate a machine-learningbased system to Portuguese clause identification. To the best of our knowledge, this is the first machine-learning-based approach to this task. The proposed system is based on Entropy Guided Transformation Learning. In order to train and evaluate the proposed system, we derive a clause annotated corpus from the Bosque corpus of the Floresta Sint́a(c)tica Project - an European and Brazilian Portuguese treebank. We include part-of-speech (POS) tags to the derived corpus by using an automatic state-of-the-art tagger. Additionally, we use a simple heuristic to derive a phrase-chunk-like (PCL) feature from phrases in the Bosque corpus. We train an extractor to this sub-task and use it to automatically include the PCL feature in the derived clause corpus. We use POS and PCL tags as input features in the proposed clause identifier. This system achieves a Fβ=1 of 73.90, when using the golden values of the PCL feature. When the automatic values are used, the system obtains Fβ=1 = 69.31. These are promising results for a first machine learning approach to Portuguese clause identification. Moreover, these results are achieved using a very simple PCL feature, which is generated by a PCL extractor developed with very little modeling effort.
AB - In this work, we apply and evaluate a machine-learningbased system to Portuguese clause identification. To the best of our knowledge, this is the first machine-learning-based approach to this task. The proposed system is based on Entropy Guided Transformation Learning. In order to train and evaluate the proposed system, we derive a clause annotated corpus from the Bosque corpus of the Floresta Sint́a(c)tica Project - an European and Brazilian Portuguese treebank. We include part-of-speech (POS) tags to the derived corpus by using an automatic state-of-the-art tagger. Additionally, we use a simple heuristic to derive a phrase-chunk-like (PCL) feature from phrases in the Bosque corpus. We train an extractor to this sub-task and use it to automatically include the PCL feature in the derived clause corpus. We use POS and PCL tags as input features in the proposed clause identifier. This system achieves a Fβ=1 of 73.90, when using the golden values of the PCL feature. When the automatic values are used, the system obtains Fβ=1 = 69.31. These are promising results for a first machine learning approach to Portuguese clause identification. Moreover, these results are achieved using a very simple PCL feature, which is generated by a PCL extractor developed with very little modeling effort.
KW - Informatics
KW - Machine Learn Approach
KW - Training Corpus
KW - Shared Task
KW - Human Language Technology
KW - Corpus Format
KW - Business informatics
UR - http://www.scopus.com/inward/record.url?scp=78650284958&partnerID=8YFLogxK
UR - https://d-nb.info/1000569276
U2 - 10.1007/978-3-642-12320-7_8
DO - 10.1007/978-3-642-12320-7_8
M3 - Article in conference proceedings
AN - SCOPUS:78650284958
SN - 3-642-12319-8
SN - 978-3-642-12319-1
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 55
EP - 64
BT - Computational Processing of the Portuguese Language
A2 - Pardo, Thiago Alexandre Salgueiro
A2 - Branco, Antonio
A2 - Klautau, Aldebaro
A2 - Viera, Renata
A2 - de Lima, Vera Lucia Strube
PB - Springer
CY - Berlin, Heidelberg
T2 - International Conference on Computational Processing of the Portuguese Language
Y2 - 27 April 2010 through 30 April 2010
ER -