Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items
Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet
Standard
in: Psychometrika, 2025.
Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items
AU - Bengs, Daniel
AU - Brefeld, Ulf
AU - Kroehne, Ulf
AU - Zehner, Fabian
N1 - Publisher Copyright: © 2025 Cambridge University Press. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Test items using open-ended response formats can increase an instrument’s construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.
AB - Test items using open-ended response formats can increase an instrument’s construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.
KW - automatic scoring
KW - item response modeling
KW - large-scale assessment
KW - Informatics
UR - http://www.scopus.com/inward/record.url?scp=105008562418&partnerID=8YFLogxK
U2 - 10.1017/psy.2025.10018
DO - 10.1017/psy.2025.10018
M3 - Journal articles
C2 - 40518623
AN - SCOPUS:105008562418
JO - Psychometrika
JF - Psychometrika
SN - 0033-3123
ER -