Table of Contents
Fetching ...

Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability

Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew Churpek, Majid Afshar

TL;DR

This study evaluates two large language models, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks to highlight the need for improved techniques in LLM confidence estimation.

Abstract

Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.

Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability

TL;DR

This study evaluates two large language models, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks to highlight the need for improved techniques in LLM confidence estimation.

Abstract

Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.

Paper Structure

This paper contains 4 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Process map in generating a diagnosis with the role of LLMs to augment human diagnostic reasoning.
  • Figure 2: Area under the receiver operating characteristic curve (AUROC) scores from both LLMs using different EHR demographics input settings, across the diagnoses prediction of Sepsis, Arrhythmia, and Congestive Heart Failure (CHF).
  • Figure 3: Pearson correlations between the LLM predicted pre-test probabilities and the baseline model (Raw Data+XGB) predicted probabilities.
  • Figure 4: Calibration curves on the four probability estimation methods, using Mistral-7B-Instruct and Llama3-70B-Chat-hf on Default patient demographic setting.