Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. / Aydın, Burak; Kışla, Tarık; Elmas, Nursel Tan et al.
In: System, Vol. 133, 103784, 10.2025.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

APA

Vancouver

Aydın B, Kışla T, Elmas NT, Bulut O. Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. System. 2025 Oct;133:103784. doi: 10.1016/j.system.2025.103784

Bibtex

@article{f284135ec3a8446cbe6e35f049a4a7f7,
title = "Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays",
abstract = "Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.",
keywords = "Automated scoring, Large language models, Multilevel models, Rater reliability, Turkish essays, Zero-shot with rubric, Educational science",
author = "Burak Aydın and Tarık Kı{\c s}la and Elmas, {Nursel Tan} and Okan Bulut",
note = "Publisher Copyright: {\textcopyright} 2025 The Authors",
year = "2025",
month = oct,
doi = "10.1016/j.system.2025.103784",
language = "English",
volume = "133",
journal = "System",
issn = "0346-251X",
publisher = "Elsevier Ltd",

}

RIS

TY - JOUR

T1 - Automated scoring in the era of artificial intelligence

T2 - An empirical study with Turkish essays

AU - Aydın, Burak

AU - Kışla, Tarık

AU - Elmas, Nursel Tan

AU - Bulut, Okan

N1 - Publisher Copyright: © 2025 The Authors

PY - 2025/10

Y1 - 2025/10

N2 - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

AB - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

KW - Automated scoring

KW - Large language models

KW - Multilevel models

KW - Rater reliability

KW - Turkish essays

KW - Zero-shot with rubric

KW - Educational science

UR - http://www.scopus.com/inward/record.url?scp=105010969717&partnerID=8YFLogxK

U2 - 10.1016/j.system.2025.103784

DO - 10.1016/j.system.2025.103784

M3 - Journal articles

AN - SCOPUS:105010969717

VL - 133

JO - System

JF - System

SN - 0346-251X

M1 - 103784

ER -

Recently viewed

Activities

  1. Differentiating forest types using Terrasar-X spotlight images based on factor analysis
  2. How does tree sapling diversity influence browsing intensity by deer across spatial scales?
  3. Inter-university Consortium for Political and Social Research Summer Program in Quantitative Methods - 2019
  4. Breaks and age related strain in continuous physical work
  5. Are Self-Employed Time and Money Poor? Dynamics of Interpendent Multidimensional Poverty with German Time Use Diary Data
  6. Istron-Tagung 2008
  7. IMISCOE (Verlag)
  8. Internes Anti-Rassismus-Training
  9. HOW SUSTAINABILITY ACCOUNTING CONTRIBUTES TO IMPROVED INFORMATION MANAGEMENT AND MANAGEMENT CONTROL
  10. The relationship between intragenerational and intergenerational justice in the use of ecosystems and their services. An ecological-economic mode.
  11. Sonic Spaces and Playfulness
  12. The relationship between intragenerational and intergenerational justice in the use of ecosystems and their services
  13. Investigating the relationship between teachers' acceptance and use of educational technology and student data
  14. Processes of Sustainability Transformation. An inter- and transdisciplinary project
  15. Empathic Healthcare Chatbots: Comparing the Effects of Emotional Expression and Caring Behavior
  16. Kunstuniversität Linz
  17. Aesthetics of complexity, artists and resilient communities in urban anthropo-scenes
  18. From the Environmental State to the Sustainability State? Conceptualization, Indicators, and Examples
  19. 2nd Interdisciplinary Insights on Fraud and Corruption - I2FC 2014
  20. Universität von St. Andrews
  21. Der "als-ob" Modus: Polizei, Protest, Staatlichkeit
  22. Experimente in den Sozialwissenschaften: Methodenkurs