Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays

Research output: Journal contributionsScientific review articlesResearch

Authors

Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

Original languageEnglish
Article number103784
JournalSystem
Volume133
Number of pages12
ISSN0346-251X
DOIs
Publication statusPublished - 10.2025

Bibliographical note

Publisher Copyright:
© 2025 The Authors

    Research areas

  • Automated scoring, Large language models, Multilevel models, Rater reliability, Turkish essays, Zero-shot with rubric
  • Educational science

Recently viewed

Publications

  1. 8th challenge on question answering over linked data (QALD-8)
  2. Deriving inferential statistics from recurrence plots
  3. Experience from downscaling IPCC-SRES scenarios to specific national-level focus scenarios for ecosystem service management
  4. Habitual Actions as a Challenge to the Standard Theory of Action
  5. Applied Conversation Analysis in Foreign Language Didactics
  6. Requests for mathematical reasoning in textbooks for primary-level students
  7. Mapping social values of ecosystem services: What is behind the map?
  8. Homogenization approach based on laminates
  9. The persistence of subsistence and the limits to development studies
  10. A Robust Approximated Derivative Action of a PID Regulator to be Applied in a Permanent Magnet Synchronous Motor Control
  11. Rebound Effects in Methods of Artificial Intelligence
  12. Is the market classification of risk always efficient?
  13. Third International Mathematics and Science Study and Trends in Mathematics and Science Studies (TIMSS)
  14. Development and characterisation of a new interface for coupling capillary LC with collision-cell ICPMS and its application for phosphorylation profiling of tryptic protein digests
  15. The significance of tree-tree interactions for forest ecosystem functioning
  16. Excellence in Teaching and Learning
  17. Is Calluna vulgaris a suitable bio-monitor of management-mediated nutrient pools in heathland ecosystems?
  18. Transformation products in the water cycle and the unsolved problem of their proactive assessment
  19. Knowledge Generation and Sustainable Development
  20. Bright Spots for Local WFD Implementation Through Collaboration with Nature Conservation Authorities?
  21. Assessing Exposure of Pesticides to Bees
  22. Web-Based Stress Management Program for University Students in Indonesia
  23. Multitrophic diversity in a biodiverse forest is highly nonlinear across spatial scales
  24. Effect of salinity-changing rates on filtration activity of mussels from two sites within the Baltic Mytilus hybrid zone
  25. Reprocessing from the inside
  26. ‘The Useful, the Bad and the Ugly’.
  27. Othering Space
  28. Science-Related Outcomes
  29. Skills and knowledge management in higher education
  30. Single-Word Recognition Need Not Depend on Single-Word Features
  31. Data quality assessment framework for critical raw materials. The case of cobalt