Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. / Aydın, Burak; Kışla, Tarık; Elmas, Nursel Tan et al.
In: System, Vol. 133, 103784, 10.2025.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

APA

Vancouver

Aydın B, Kışla T, Elmas NT, Bulut O. Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. System. 2025 Oct;133:103784. doi: 10.1016/j.system.2025.103784

Bibtex

@article{f284135ec3a8446cbe6e35f049a4a7f7,
title = "Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays",
abstract = "Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.",
keywords = "Automated scoring, Large language models, Multilevel models, Rater reliability, Turkish essays, Zero-shot with rubric, Educational science",
author = "Burak Aydın and Tarık Kı{\c s}la and Elmas, {Nursel Tan} and Okan Bulut",
note = "Publisher Copyright: {\textcopyright} 2025 The Authors",
year = "2025",
month = oct,
doi = "10.1016/j.system.2025.103784",
language = "English",
volume = "133",
journal = "System",
issn = "0346-251X",
publisher = "Elsevier Ltd",

}

RIS

TY - JOUR

T1 - Automated scoring in the era of artificial intelligence

T2 - An empirical study with Turkish essays

AU - Aydın, Burak

AU - Kışla, Tarık

AU - Elmas, Nursel Tan

AU - Bulut, Okan

N1 - Publisher Copyright: © 2025 The Authors

PY - 2025/10

Y1 - 2025/10

N2 - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

AB - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

KW - Automated scoring

KW - Large language models

KW - Multilevel models

KW - Rater reliability

KW - Turkish essays

KW - Zero-shot with rubric

KW - Educational science

UR - http://www.scopus.com/inward/record.url?scp=105010969717&partnerID=8YFLogxK

U2 - 10.1016/j.system.2025.103784

DO - 10.1016/j.system.2025.103784

M3 - Journal articles

AN - SCOPUS:105010969717

VL - 133

JO - System

JF - System

SN - 0346-251X

M1 - 103784

ER -