Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays

Publikation: Beiträge in ZeitschriftenZeitschriftenaufsätzeForschungbegutachtet

Standard

Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. / Aydın, Burak; Kışla, Tarık; Elmas, Nursel Tan et al.
in: System, Jahrgang 133, 103784, 10.2025.

Publikation: Beiträge in ZeitschriftenZeitschriftenaufsätzeForschungbegutachtet

Harvard

APA

Vancouver

Aydın B, Kışla T, Elmas NT, Bulut O. Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays. System. 2025 Okt;133:103784. doi: 10.1016/j.system.2025.103784

Bibtex

@article{f284135ec3a8446cbe6e35f049a4a7f7,
title = "Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays",
abstract = "Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.",
keywords = "Automated scoring, Large language models, Multilevel models, Rater reliability, Turkish essays, Zero-shot with rubric, Educational science",
author = "Burak Aydın and Tarık Kı{\c s}la and Elmas, {Nursel Tan} and Okan Bulut",
note = "Publisher Copyright: {\textcopyright} 2025 The Authors",
year = "2025",
month = oct,
doi = "10.1016/j.system.2025.103784",
language = "English",
volume = "133",
journal = "System",
issn = "0346-251X",
publisher = "Elsevier Ltd",

}

RIS

TY - JOUR

T1 - Automated scoring in the era of artificial intelligence

T2 - An empirical study with Turkish essays

AU - Aydın, Burak

AU - Kışla, Tarık

AU - Elmas, Nursel Tan

AU - Bulut, Okan

N1 - Publisher Copyright: © 2025 The Authors

PY - 2025/10

Y1 - 2025/10

N2 - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

AB - Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

KW - Automated scoring

KW - Large language models

KW - Multilevel models

KW - Rater reliability

KW - Turkish essays

KW - Zero-shot with rubric

KW - Educational science

UR - http://www.scopus.com/inward/record.url?scp=105010969717&partnerID=8YFLogxK

U2 - 10.1016/j.system.2025.103784

DO - 10.1016/j.system.2025.103784

M3 - Journal articles

AN - SCOPUS:105010969717

VL - 133

JO - System

JF - System

SN - 0346-251X

M1 - 103784

ER -

DOI

Zuletzt angesehen

Forschende

  1. Nadine Lüpschen

Publikationen

  1. Guest Editorial
  2. On the Epistemology of Computer Simulation
  3. A Motion-Sensorless Control for Intake Valves in Combustion Engines
  4. Use of lignins from sugarcane bagasse for assembling microparticles loaded with Azadirachta indica extracts for use as neem-based organic insecticides
  5. Skill learning as a concept in life-span developmental psychology
  6. How to Explain Major Policy Change Towards Sustainability? Bringing Together the Multiple Streams Framework and the Multilevel Perspective on Socio-Technical Transitions to Explore the German “Energiewende”
  7. Improving Human-Machine Interaction
  8. Adapting and evolving-learning place cooperation in change
  9. Continuous Casting with Mid-Process Alloying
  10. Theorizing path dependence
  11. Synthesis, self-assembly, bacterial and fungal toxicity, and preliminary biodegradation studies of a series of L-phenylalanine-derived surface-active ionic liquids
  12. Einführung in Grundlagen der theoretischen Informatik
  13. Learning processes for interpersonal competence development in project-based sustainability courses – insights from a comparative international study
  14. CASE via MS
  15. Heterogeneity and Diversity
  16. Inventory of biodegradation data of ionic liquids
  17. Gasteditorial
  18. Advancing science on the multiple connections between biodiversity, ecosystems and people
  19. Social group membership does not modulate automatic imitation in a contrastive multi-agent paradigm
  20. Anticipated imitation of multiple agents
  21. Three-dimensional microstructural analysis of Mg-Al-Zn alloys by synchrotron-radiation-based microtomography
  22. Entwicklung und realisierung eines computer-basierten lernprogramms zur GMP-schulung/Programm-entwicklung und benutzer-akzeptanz
  23. Perceptual latency priming
  24. Impacts of urban real-world labs: Insights from a co-evaluation process informed by structuration theory in Wuppertal-Mirke
  25. Targeted metabolomics of pellicle and saliva in children with different caries activity
  26. Public service media, innovation policy and the ‘crowding out’ problem
  27. Flexible Manufacturing of Concave–Convex Parts by Incremental Sheet Forming with Active Medium
  28. On melting summits
  29. Myth/Mythology
  30. Young children spontaneously recreate core properties of language in a new modality
  31. Over here and over there

Presse / Medien

  1. Rezension zu "Papierboot"