QALD-9-ES: A Spanish Dataset for Question Answering Systems
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
Knowledge Graphs: Semantics, Machine Learning, and Languages: Proceedings of the 19th International Conference on Semantic Systems, 20-22 September 2023, Leipzig, Germany. ed. / Maribel Acosta; Silvio Peroni; Sahar Vahdati; Anna Lisa Gentile; Tassilo Pellegrini; Jan-Christoph Kalo. Amsterdam: IOS Press BV, 2023. p. 38-52 (Studies on the Semantic Web; Vol. 56).
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - QALD-9-ES: A Spanish Dataset for Question Answering Systems
AU - Soruco, Javier
AU - Collarana, Diego
AU - Both, Andreas
AU - Usbeck, Ricardo
N1 - Conference code: 19
PY - 2023/9/11
Y1 - 2023/9/11
N2 - Knowledge Graph Question Answering (KGQA) systems enable access to semantic information for any user who can compose a question in natural language. KGQA systems are now a core component of many industrial applications, including chatbots and conversational search applications. Although distinct worldwide cultures speak different languages, the number of languages covered by KGQA systems and its resources is mainly limited to English. To implement KGQA systems worldwide, we need to expand the current KGQA resources to languages other than English. Taking into account the recent popularity that LargeScale Language Models are receiving, we believe that providing quality resources is key to the development of future pipelines. One of these resources is the datasets used to train and test KGQA systems. Among the few multilingual KGQA datasets available, only one covers Spanish, i.e., QALD-9. We reviewed the Spanish translations in the QALD-9 dataset and confirmed several issues that may affect the KGQA system’s quality. Taking this into account, we created new Spanish translations for this dataset and reviewed them manually with the help of native speakers. This dataset provides newly created, high-quality translations for QALD-9; we call this extension QALD-9-ES. We merged these translations into the QALD-9-plus dataset, which provides trustworthy native translations for QALD-9 in nine languages, intending to create one complete source of high-quality translations. We compared the new translations with the QALD-9 original ones using Languageagnostic quantitative text analysis measures and found improvements in the results of the new translations. Finally, we compared both translations using the GERBIL QA benchmark framework using a KGQA system that supports Spanish. Although the question-answering scores only improved slightly, we believe that improving the quality of the existing translations will result in better KGQA systems and therefore increase the applicability of KGQA w.r.t. the Spanish language domain.
AB - Knowledge Graph Question Answering (KGQA) systems enable access to semantic information for any user who can compose a question in natural language. KGQA systems are now a core component of many industrial applications, including chatbots and conversational search applications. Although distinct worldwide cultures speak different languages, the number of languages covered by KGQA systems and its resources is mainly limited to English. To implement KGQA systems worldwide, we need to expand the current KGQA resources to languages other than English. Taking into account the recent popularity that LargeScale Language Models are receiving, we believe that providing quality resources is key to the development of future pipelines. One of these resources is the datasets used to train and test KGQA systems. Among the few multilingual KGQA datasets available, only one covers Spanish, i.e., QALD-9. We reviewed the Spanish translations in the QALD-9 dataset and confirmed several issues that may affect the KGQA system’s quality. Taking this into account, we created new Spanish translations for this dataset and reviewed them manually with the help of native speakers. This dataset provides newly created, high-quality translations for QALD-9; we call this extension QALD-9-ES. We merged these translations into the QALD-9-plus dataset, which provides trustworthy native translations for QALD-9 in nine languages, intending to create one complete source of high-quality translations. We compared the new translations with the QALD-9 original ones using Languageagnostic quantitative text analysis measures and found improvements in the results of the new translations. Finally, we compared both translations using the GERBIL QA benchmark framework using a KGQA system that supports Spanish. Although the question-answering scores only improved slightly, we believe that improving the quality of the existing translations will result in better KGQA systems and therefore increase the applicability of KGQA w.r.t. the Spanish language domain.
KW - Business informatics
KW - Knowledge Graphs
KW - Informatics
KW - Question Answering
KW - Dataset
UR - https://www.iospress.com/catalog/books/knowledge-graphs-semantics-machine-learning-and-languages
UR - https://www.mendeley.com/catalogue/863f1283-9035-351a-bc12-b1ae046d0649/
U2 - 10.3233/SSW230004
DO - 10.3233/SSW230004
M3 - Article in conference proceedings
SN - 978-1-64368-424-6
T3 - Studies on the Semantic Web
SP - 38
EP - 52
BT - Knowledge Graphs: Semantics, Machine Learning, and Languages
A2 - Acosta, Maribel
A2 - Peroni, Silvio
A2 - Vahdati, Sahar
A2 - Gentile, Anna Lisa
A2 - Pellegrini, Tassilo
A2 - Kalo, Jan-Christoph
PB - IOS Press BV
CY - Amsterdam
T2 - 19th International Conference on Semantic Systems
Y2 - 20 September 2023 through 22 September 2023
ER -