Low Resource Question Answering: An Amharic Benchmarking Dataset: An Amharic Benchmarking Dataset
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
The Fifth Workshop on Resources for African Indigenous Languages @LREC-COLING-2024 (RAIL): Workshop Proceedings. ed. / Rooweither Mabuya; Muzi Matfunjwa; Mmasibidi Setaka; Menno van Zaanen. Paris: European Language Resources Association (ELRA), 2024. p. 124-132 (LREC proceedings), ( International conference on computational linguistics).
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - Low Resource Question Answering: An Amharic Benchmarking Dataset
T2 - 5th Workshop on Resources for African Indigenous Languages - RAIL 2024
AU - Taffa, Tilahun Abedissa
AU - Assabie, Yaregal
AU - Usbeck, Ricardo
N1 - Conference code: 5
PY - 2024
Y1 - 2024
N2 - Question Answering (QA) systems return concise answers or answer lists based on natural language text, which uses a given context document. Many resources go into curating QA datasets to advance the development of robust QA models. There is a surge in QA datasets for languages such as English; this is different for low-resource languages like Amharic. Indeed, there is no published or publicly available Amharic QA dataset. Hence, to foster further research in low-resource QA, we present the first publicly available benchmarking Amharic Question Answering Dataset (Amh-QuAD). We crowdsource 2,628 question-answer pairs from over 378 Amharic Wikipedia articles. Using the training set, we fine-tune an XLM-R-based language model and introduce a new reader model. Leveraging our newly fine-tuned reader run a baseline model to spark open-domain Amharic QA research interest. The best-performing baseline QA achieves an F-score of 80.3 and 81.34 in retriever-reader and reading comprehension settings.
AB - Question Answering (QA) systems return concise answers or answer lists based on natural language text, which uses a given context document. Many resources go into curating QA datasets to advance the development of robust QA models. There is a surge in QA datasets for languages such as English; this is different for low-resource languages like Amharic. Indeed, there is no published or publicly available Amharic QA dataset. Hence, to foster further research in low-resource QA, we present the first publicly available benchmarking Amharic Question Answering Dataset (Amh-QuAD). We crowdsource 2,628 question-answer pairs from over 378 Amharic Wikipedia articles. Using the training set, we fine-tune an XLM-R-based language model and introduce a new reader model. Leveraging our newly fine-tuned reader run a baseline model to spark open-domain Amharic QA research interest. The best-performing baseline QA achieves an F-score of 80.3 and 81.34 in retriever-reader and reading comprehension settings.
KW - Amh-QuAD
KW - Amharic Question Answering Dataset
KW - Amharic Reading Comprehension
KW - Low Resource Question Answering
KW - Informatics
UR - http://www.scopus.com/inward/record.url?scp=85195211713&partnerID=8YFLogxK
UR - https://aclanthology.org/2024.rail-1.0.pdf
UR - https://aclanthology.org/events/coling-2024/#2024rail-1
UR - https://www.mendeley.com/catalogue/3de6d397-698d-3f23-bb7f-847fca82ca94/
M3 - Article in conference proceedings
AN - SCOPUS:85195211713
SN - 9782493814401
T3 - LREC proceedings
SP - 124
EP - 132
BT - The Fifth Workshop on Resources for African Indigenous Languages @LREC-COLING-2024 (RAIL)
A2 - Mabuya, Rooweither
A2 - Matfunjwa, Muzi
A2 - Setaka, Mmasibidi
A2 - van Zaanen, Menno
PB - European Language Resources Association (ELRA)
CY - Paris
Y2 - 25 May 2024
ER -