Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Publikation: Beiträge in SammelwerkenAufsätze in KonferenzbändenForschungbegutachtet

Authors

GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4,"by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

OriginalspracheEnglisch
TitelSIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
HerausgeberGrace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, Yi Zhang
Anzahl der Seiten10
VerlagAssociation for Computing Machinery, Inc
Erscheinungsdatum11.07.2024
Seiten1073-1082
ISBN (elektronisch)9798400704314
DOIs
PublikationsstatusErschienen - 11.07.2024
Extern publiziertJa
Veranstaltung47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 - Washington, USA / Vereinigte Staaten
Dauer: 14.07.202418.07.2024

Bibliographische Notiz

Publisher Copyright:
© 2024 Owner/Author.

DOI

Zuletzt angesehen

Publikationen

  1. Wissenschaftliche Weiterbildung 4.0
  2. Nachlese
  3. Nachhaltigkeit und Journalismus
  4. Overclaiming is not related to dark triad personality traits or stated and revealed risk preferences
  5. Trends for snow cover and river flows in the Pamirs (Central Asia)
  6. § 261 HGB
  7. Correction
  8. Schichten und der Zwischenraum
  9. Klimaziel 2020 verfehlt
  10. “Great Men’s” work or just an inevitable consequence?
  11. Effects of the Neuro-Turn
  12. Eilanträge in Sachen CETA
  13. Emotional appropriateness and decision making
  14. Leading indicators for the US housing market: New empirical evidence and thoughts about implications for risk managers and ESG investors
  15. Marktorientiertes Nachhaltigkeitscontrolling
  16. Institutional Entrepreneurship
  17. Religionsunterricht in der konsequent pluralistischen Schule
  18. The psychology of entrepreneurship
  19. A reversed double movement in Brazil
  20. Detection of up to 65% of precancerous lesions of the human colon and rectum by mutation analysis of APC, K-Ras, B-Raf and CTNNB1.
  21. Studium gestattet?
  22. Preference for violent electronic games and aggressive behavior among children
  23. Das Datenhandeln
  24. Auf dem Weg zu einer erfolgreichen Umsetzung von Organisationsentwicklungsprojekten an Hochschulen
  25. Determinanten menschlicher Fehler in Risikoindustrien
  26. Modelling age-related changes in motor competence and physical fitness in high-level youth soccer players
  27. Synapses in the Network
  28. Bewegte Sprache – Ein Leben mit und für Mehrsprachigkeit
  29. Analysing the Gender Wage Gap Using Personnel Records of a Large German Company
  30. Pop-Musik-Analysen
  31. Moralphilosophische Fragen zum "Embryo"
  32. Governance in the Face of Extreme Events