Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Publikation: Beiträge in SammelwerkenAufsätze in KonferenzbändenForschungbegutachtet

Standard

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. / Roy, Soumyadeep; Khatua, Aparup; Ghoochani, Fatemeh et al.
SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. Hrsg. / Grace Hui Yang; Hongning Wang; Sam Han; Claudia Hauff; Guido Zuccon; Yi Zhang. Association for Computing Machinery, Inc, 2024. S. 1073-1082 (SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval).

Publikation: Beiträge in SammelwerkenAufsätze in KonferenzbändenForschungbegutachtet

Harvard

Roy, S, Khatua, A, Ghoochani, F, Hadler, U, Nejdl, W & Ganguly, N 2024, Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. in G Hui Yang, H Wang, S Han, C Hauff, G Zuccon & Y Zhang (Hrsg.), SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, Inc, S. 1073-1082, 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington, USA / Vereinigte Staaten, 14.07.24. https://doi.org/10.1145/3626772.3657882

APA

Roy, S., Khatua, A., Ghoochani, F., Hadler, U., Nejdl, W., & Ganguly, N. (2024). Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. In G. Hui Yang, H. Wang, S. Han, C. Hauff, G. Zuccon, & Y. Zhang (Hrsg.), SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (S. 1073-1082). (SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/3626772.3657882

Vancouver

Roy S, Khatua A, Ghoochani F, Hadler U, Nejdl W, Ganguly N. Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. in Hui Yang G, Wang H, Han S, Hauff C, Zuccon G, Zhang Y, Hrsg., SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2024. S. 1073-1082. (SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval). doi: 10.1145/3626772.3657882

Bibtex

@inbook{43ac67e85b534e2092a41a4505500af7,
title = "Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions",
abstract = "GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a {"}Reasonable response by GPT-4,{"}by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.",
keywords = "gpt-4, medical qa, multi-label dataset, usmle error taxonomy, Informatics, Business informatics",
author = "Soumyadeep Roy and Aparup Khatua and Fatemeh Ghoochani and Uwe Hadler and Wolfgang Nejdl and Niloy Ganguly",
note = "Publisher Copyright: {\textcopyright} 2024 Owner/Author.; 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 ; Conference date: 14-07-2024 Through 18-07-2024",
year = "2024",
month = jul,
day = "11",
doi = "10.1145/3626772.3657882",
language = "English",
series = "SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval",
publisher = "Association for Computing Machinery, Inc",
pages = "1073--1082",
editor = "{Hui Yang}, Grace and Hongning Wang and Sam Han and Claudia Hauff and Guido Zuccon and Yi Zhang",
booktitle = "SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval",
address = "United States",

}

RIS

TY - CHAP

T1 - Beyond Accuracy

T2 - 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024

AU - Roy, Soumyadeep

AU - Khatua, Aparup

AU - Ghoochani, Fatemeh

AU - Hadler, Uwe

AU - Nejdl, Wolfgang

AU - Ganguly, Niloy

N1 - Publisher Copyright: © 2024 Owner/Author.

PY - 2024/7/11

Y1 - 2024/7/11

N2 - GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4,"by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

AB - GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4,"by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

KW - gpt-4

KW - medical qa

KW - multi-label dataset

KW - usmle error taxonomy

KW - Informatics

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=85199188807&partnerID=8YFLogxK

U2 - 10.1145/3626772.3657882

DO - 10.1145/3626772.3657882

M3 - Article in conference proceedings

AN - SCOPUS:85199188807

T3 - SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

SP - 1073

EP - 1082

BT - SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

A2 - Hui Yang, Grace

A2 - Wang, Hongning

A2 - Han, Sam

A2 - Hauff, Claudia

A2 - Zuccon, Guido

A2 - Zhang, Yi

PB - Association for Computing Machinery, Inc

Y2 - 14 July 2024 through 18 July 2024

ER -

DOI

Zuletzt angesehen

Publikationen

  1. Grassland management intensification weakens the associations among the diversities of multiple plant and animal taxa
  2. Defending one's worldview under mortality salience
  3. Landscape narratives in practice
  4. Unterrichtsdiagnostik als Voraussetzung für Unterrichtsentwicklung
  5. Unternehmen vor öffentlichen Auseinandersetzungen
  6. Sustainability Balanced Scorecard
  7. On the way to a Post-Carbon Society
  8. Schreiben im Übergang von Bildungsinstitutionen
  9. Ohne Lehrsatz und Methode
  10. A proposal of personal competencies for sustainable consumption
  11. Emotional states of drivers and the impact on driving behaviour - a simulator study
  12. Das System Schule heute und der Stellenwert von Eltern
  13. Gehen
  14. DESI - Text Production
  15. Local food sovereignty for global food security?
  16. Gut vorbereitet auf den Umgang mit sprachlicher Diversität im Unterricht?
  17. Deduktion, Induktion, Transduktion
  18. Skandinavische Weihnachtsmärchen
  19. Zur Pluralisierung der Lebenswelt und einer Individualisierung der Lebensführung als Arbeit am Selbst
  20. The social–ecological ladder of restoration ambition
  21. Ökonomisierung der Freizeit
  22. The role of sustainability communication in the attitude–behaviour gap of sustainable tourism
  23. Sparkassen-Beteiligungsgesellschaften eröffnen sich neue Marktchancen
  24. Antonio Negri. Une philosophie de la subversion
  25. DBLPLink
  26. Qualitätsmanagement durch universitäre Ideen in modernen Prozessen
  27. The Conjoined Twin Sisters Helen and Judith (1701-1723) and Their Pictorial Impact in Later 18th-Century Science
  28. § 292 Haftung bei Herausgabepflicht

Presse / Medien

  1. Nachhaltig muss besser sein