Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. / Aydın, Burak; Kışla, Tarık; Elmas, Nursel Tan et al.
In: System, Vol. 133, 103784, 10.2025.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

APA

Vancouver

Aydın B, Kışla T, Elmas NT, Bulut O. Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. System. 2025 Oct;133:103784. doi: 10.1016/j.system.2025.103784

Bibtex

@article{f284135ec3a8446cbe6e35f049a4a7f7,
title = "Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays",
abstract = "Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.",
keywords = "Automated scoring, Large language models, Multilevel models, Rater reliability, Turkish essays, Zero-shot with rubric, Educational science",
author = "Burak Aydın and Tarık Kı{\c s}la and Elmas, {Nursel Tan} and Okan Bulut",
note = "Publisher Copyright: {\textcopyright} 2025 The Authors",
year = "2025",
month = oct,
doi = "10.1016/j.system.2025.103784",
language = "English",
volume = "133",
journal = "System",
issn = "0346-251X",
publisher = "Elsevier Ltd",

}

RIS

TY - JOUR

T1 - Automated scoring in the era of artificial intelligence

T2 - An empirical study with Turkish essays

AU - Aydın, Burak

AU - Kışla, Tarık

AU - Elmas, Nursel Tan

AU - Bulut, Okan

N1 - Publisher Copyright: © 2025 The Authors

PY - 2025/10

Y1 - 2025/10

N2 - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

AB - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

KW - Automated scoring

KW - Large language models

KW - Multilevel models

KW - Rater reliability

KW - Turkish essays

KW - Zero-shot with rubric

KW - Educational science

UR - http://www.scopus.com/inward/record.url?scp=105010969717&partnerID=8YFLogxK

U2 - 10.1016/j.system.2025.103784

DO - 10.1016/j.system.2025.103784

M3 - Journal articles

AN - SCOPUS:105010969717

VL - 133

JO - System

JF - System

SN - 0346-251X

M1 - 103784

ER -

Recently viewed

Publications

  1. Mechanical characterisation and modelling of electrospun materials for biomedical applications
  2. Automated scoring in the era of artificial intelligence
  3. How problem-based or direct instructional case-based learning environments influence pre-service teachers’ cognitive load, motivation and emotions
  4. Доля на внутрішньому ринку“ для України в рамках Угоди про асоціацію між Україною та ЄС
  5. A sensorless control using a sliding-mode observer for an electromagnetic valve actuator in automotive applications
  6. The bispecific SDF1-GPVI fusion protein preserves myocardial function after transient ischemia in mice.
  7. Maschinenbelegungsplanung mit evolutionären Algorithmen
  8. Editorial overview
  9. Systemic Risks from Different Perspectives
  10. Transculturality in Top Model
  11. High quality extrudates from aluminum chips by new billet compaction and deformation routes
  12. Assuring a safe, secure and sustainable
  13. Geheime Verwandtschaften
  14. The ESAFORM benchmark 2023
  15. Pricing effects when competitors arrive
  16. An Introduction to Corporate Environmental Management
  17. Regional powers and the politics of scale
  18. Attention and Information Acquisition
  19. Cinephilia in transition
  20. Was tun, Herr Luhmann?
  21. Bodenlos.
  22. Removal of the anti-cancer drug methotrexate from water by advanced oxidation processes
  23. Elections in Asia and the Pacific: a data handbook
  24. Exploring the influence of testimonial source on attitudes towards e-mental health interventions among university students
  25. A Web-Based Stress Management Intervention for University Students in Indonesia (Rileks)
  26. § 2 Zur Konzeption des Handbuchs