Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays
Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet
Standard
in: System, Jahrgang 133, 103784, 10.2025.
Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Automated scoring in the era of artificial intelligence
T2 - An empirical study with Turkish essays
AU - Aydın, Burak
AU - Kışla, Tarık
AU - Elmas, Nursel Tan
AU - Bulut, Okan
N1 - Publisher Copyright: © 2025 The Authors
PY - 2025/10
Y1 - 2025/10
N2 - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.
AB - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.
KW - Automated scoring
KW - Large language models
KW - Multilevel models
KW - Rater reliability
KW - Turkish essays
KW - Zero-shot with rubric
KW - Educational science
UR - http://www.scopus.com/inward/record.url?scp=105010969717&partnerID=8YFLogxK
U2 - 10.1016/j.system.2025.103784
DO - 10.1016/j.system.2025.103784
M3 - Journal articles
AN - SCOPUS:105010969717
VL - 133
JO - System
JF - System
SN - 0346-251X
M1 - 103784
ER -