Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Daniel Bengs; Ulf Brefeld; Ulf Kroehne; Fabian Zehner

doi:10.1017/psy.2025.10018

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

Standard

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items. / Bengs, Daniel; Brefeld, Ulf; Kroehne, Ulf et al.
in: Psychometrika, Jahrgang 90, Nr. 4, 01.09.2025, S. 1346-1367.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

Bibtex

@article{8789e9a45ccf4eb7abd0a2908bf480b0,

title = "Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items",

abstract = "Test items using open-ended response formats can increase an instrument{\textquoteright}s construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.",

keywords = "automatic scoring, item response modeling, large-scale assessment, Informatics",

author = "Daniel Bengs and Ulf Brefeld and Ulf Kroehne and Fabian Zehner",

note = "Publisher Copyright: {\textcopyright} The Author(s), 2025. Published by Cambridge University Press.",

year = "2025",

month = sep,

day = "1",

doi = "10.1017/psy.2025.10018",

language = "English",

volume = "90",

pages = "1346--1367",

journal = "Psychometrika",

issn = "0033-3123",

publisher = "Cambridge University Press",

number = "4",

}

RIS

TY - JOUR

T1 - Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

AU - Bengs, Daniel

AU - Brefeld, Ulf

AU - Kroehne, Ulf

AU - Zehner, Fabian

N1 - Publisher Copyright: © The Author(s), 2025. Published by Cambridge University Press.

PY - 2025/9/1

Y1 - 2025/9/1

N2 - Test items using open-ended response formats can increase an instrument’s construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.

AB - Test items using open-ended response formats can increase an instrument’s construct validity. However, traditionally, their application in educational testing requires human coders to score the responses. Manual scoring not only increases operational costs but also prohibits the use of evidence from open-ended items to inform routing decisions in adaptive designs. Using machine learning and natural language processing, automatic scoring provides classifiers that can instantly assign scores to text responses. Although optimized for agreement with manual scores, automatic scoring is not perfectly accurate and introduces an additional source of error into the response process, leading to a misspecification of the measurement model used with the manual score. We propose two joint models for manual and automatic scores of automatically scored open-ended items. Our models extend a given model from Item Response Theory for the manual scores by a component for the automatic scores, accounting for classification errors. The models were evaluated using data from the Programme for International Student Assessment (2012) and simulated data, demonstrating their capacity to mitigate the impact of classification errors on ability estimation compared to a baseline that disregards classification errors.

KW - automatic scoring

KW - item response modeling

KW - large-scale assessment

KW - Informatics

UR - http://www.scopus.com/inward/record.url?scp=105008562418&partnerID=8YFLogxK

U2 - 10.1017/psy.2025.10018

DO - 10.1017/psy.2025.10018

M3 - Journal articles

C2 - 40518623

AN - SCOPUS:105008562418

VL - 90

SP - 1346

EP - 1367

JO - Psychometrika

JF - Psychometrika

SN - 0033-3123

IS - 4

ER -

In der gleichen Zeitschrift

A Multimethod Latent State-Trait Model for Structurally Different and Interchangeable Methods

Koch, T., Schultze, M., Holtmann, J., Geiser, C. & Eid, M., 01.03.2017, in: Psychometrika. 82, 1, S. 17-47 31 S.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

Weitere Publikationen dieser Person(en)

Interactive sequential generative models for team sports

Fassmeyer, D., Cordes, M. & Brefeld, U., 02.2025, in: Machine Learning. 114, 2, 15 S., 38.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

Machine Learning and Data Mining for Sports Analytics: 11th International Workshop, MLSA 2024, Vilnius, Lithuania, September 9, 2024, Revised Selected Papers

Brefeld, U. (Herausgeber*in), Davis, J. (Herausgeber*in), Van Haaren, J. (Herausgeber*in) & Zimmermann, A. (Herausgeber*in), 2025, Cham: Springer Verlag. 119 S. (Communications in Computer and Information Science; Band 2460)

Publikation: Bücher und Anthologien › Konferenzbände und -dokumentationen › Forschung

Masked autoencoder for multiagent trajectories

Rudolph, Y. & Brefeld, U., 02.2025, in: Machine Learning. 114, 2, 18 S., 44.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

Self-improvement for Computerized Adaptive Testing

Rudolph, Y., Neubauer, K. & Brefeld, U., 2026, Machine Learning and Knowledge Discovery in Databases - Research Track: European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings. Ribeiro, R. P., Jorge, A. M., Soares, C., Gama, J., Pfahringer, B., Japkowicz, N., Larrañaga, P. & Abreu, P. H. (Hrsg.). Cham: Springer International Publishing, Band 2. S. 70-86 17 S. (Lecture Notes in Computer Science; Band 16014 LNCS).

Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet

The promise and challenges of computer mouse trajectories in DMHIs – A feasibility study on pre-treatment dropout predictions

Zantvoort, K., Matthiesen, J., Bjurner, P., Bendix, M., Brefeld, U., Funk, B. & Kaldo, V., 06.2025, in: Internet Interventions. 40, 7 S., 100828.

Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet

DOI

https://doi.org/10.1017/psy.2025.10018
Endgültige, publizierte Fassung

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Standard

Harvard

APA

Vancouver

Bibtex

RIS

In der gleichen Zeitschrift

A Multimethod Latent State-Trait Model for Structurally Different and Interchangeable Methods

Weitere Publikationen dieser Person(en)

Interactive sequential generative models for team sports

Machine Learning and Data Mining for Sports Analytics: 11th International Workshop, MLSA 2024, Vilnius, Lithuania, September 9, 2024, Revised Selected Papers

Masked autoencoder for multiagent trajectories

Self-improvement for Computerized Adaptive Testing

The promise and challenges of computer mouse trajectories in DMHIs – A feasibility study on pre-treatment dropout predictions

DOI

Zuletzt angesehen

Forschende

Aktivitäten

Publikationen