GPT-4's assessment of its performance in a USMLE-based case study

Uttam Dhakal; Aniket Kumar Singh; Suman Devkota; Yogesh Sapkota; Bishal Lamichhane; Suprinsa Paudyal; Chandra Dhakal

GPT-4's assessment of its performance in a USMLE-based case study

Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal, Chandra Dhakal

TL;DR

This work investigates GPT-4's ability to self-assess confidence when answering USMLE-style questions and how feedback affects calibration. Using a simple prompting strategy, the model reports Absolute and Relative Confidence before and after each item under With Feedback and No Feedback settings across 100 questions (plus 16 high-school items). The findings show consistently high confidence (around 0.9) with no clear, consistent improvement in accuracy due to feedback, though confidence dynamics vary across questions. The study highlights the importance of confidence calibration for AI in healthcare and suggests that feedback mechanisms may need careful design to avoid over- or under-confidence in critical medical contexts.

Abstract

This study investigates GPT-4's assessment of its performance in healthcare applications. A simple prompting technique was used to prompt the LLM with questions taken from the United States Medical Licensing Examination (USMLE) questionnaire and it was tasked to evaluate its confidence score before posing the question and after asking the question. The questionnaire was categorized into two groups-questions with feedback (WF) and questions with no feedback(NF) post-question. The model was asked to provide absolute and relative confidence scores before and after each question. The experimental findings were analyzed using statistical tools to study the variability of confidence in WF and NF groups. Additionally, a sequential analysis was conducted to observe the performance variation for the WF and NF groups. Results indicate that feedback influences relative confidence but doesn't consistently increase or decrease it. Understanding the performance of LLM is paramount in exploring its utility in sensitive areas like healthcare. This study contributes to the ongoing discourse on the reliability of AI, particularly of LLMs like GPT-4, within healthcare, offering insights into how feedback mechanisms might be optimized to enhance AI-assisted medical education and decision support.

GPT-4's assessment of its performance in a USMLE-based case study

TL;DR

Abstract

GPT-4's assessment of its performance in a USMLE-based case study

Authors

TL;DR

Abstract

Table of Contents

Figures (11)