Table of Contents
Fetching ...

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Jeremy Qin, Bang Liu, Quoc Dinh Nguyen

TL;DR

This work proposes a novel method, Atypical Presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates, and significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others.

Abstract

Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

TL;DR

This work proposes a novel method, Atypical Presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates, and significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others.

Abstract

Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.
Paper Structure (31 sections, 5 equations, 26 figures, 3 tables)

This paper contains 31 sections, 5 equations, 26 figures, 3 tables.

Figures (26)

  • Figure 1: A physician diagnoses a patient who returned from a camping trip, presenting a combination of common symptoms and signs like fever and headaches. However, by recognizing rashes an atypical symptom, the physician ultimately identifies the condition as an allergy.
  • Figure 2: Calibration Curves of the different methods for GPT-3.5-turbo
  • Figure 3: ECE of GPT-3.5-turbo for each method across all three datasets.
  • Figure 4: Accuracy by Typicality bins of GPT3.5-turbo for Atypical Presentations Aware Recalibration methods.
  • Figure 5: Distribution of atypicality scores between Atypical Presentations and Atypical Scenario of GPT-3.5-turbo on MedQA.
  • ...and 21 more figures