Table of Contents
Fetching ...

Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features

Alberto Testoni, Iacer Calixto

TL;DR

The paper benchmarks uncertainty quantification for clinical QA by evaluating 10 open-source LLMs and representative proprietary models across 11 medical specialties and 6 question types, using $p_\theta(y|x)$ to analyze discrimination ($AUROC$) and calibration ($ECE$, $Brier$). It compares score-based, consistency-based, and conformal-set methods, and introduces a lightweight behavioral-feature approach derived from reasoning traces as a single-pass proxy for uncertainty. Findings reveal substantial heterogeneity: uncertainty reliability depends on specialty and question type, larger models do not universally improve calibration, and a simple regression on behavioral features can approximate sampling-based methods. The study advocates context-aware evaluation and suggests ensemble or task-specific model use, providing data, code, and a roadmap toward safer deployment of LLMs in healthcare.

Abstract

Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models), alongside representative proprietary models. We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the need to select or ensemble models based on their distinct, complementary strengths and clinical use.

Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features

TL;DR

The paper benchmarks uncertainty quantification for clinical QA by evaluating 10 open-source LLMs and representative proprietary models across 11 medical specialties and 6 question types, using to analyze discrimination () and calibration (, ). It compares score-based, consistency-based, and conformal-set methods, and introduces a lightweight behavioral-feature approach derived from reasoning traces as a single-pass proxy for uncertainty. Findings reveal substantial heterogeneity: uncertainty reliability depends on specialty and question type, larger models do not universally improve calibration, and a simple regression on behavioral features can approximate sampling-based methods. The study advocates context-aware evaluation and suggests ensemble or task-specific model use, providing data, code, and a roadmap toward safer deployment of LLMs in healthcare.

Abstract

Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models), alongside representative proprietary models. We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the need to select or ensemble models based on their distinct, complementary strengths and clinical use.

Paper Structure

This paper contains 49 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Uncertainty estimation for clinical QA across LLMs featuring AUROC (discrimination) vs. ECE (calibration). Marker size indicates model size; color and shape indicate model type (general, reasoning, biomedical). The green-shaded area (top left) highlights the desirable region: low calibration error and high discrimination. Most single-pass methods fail on both; even G-NLL aichberger2024rethinking underperforms despite good performance reported on non-clinical QA. Semantic Entropy achieves strong performance, though it requires multiple generations.
  • Figure 2: Discrimination (1$-$AUROC) and calibration (average ECE and Brier score) of semantic entropy–based uncertainty estimates across specialties. $\star$ indicates the most accurate model per specialty.
  • Figure 3: Calibration/discrimination per question type using Semantic Entropy estimates.
  • Figure 4: Our proposed method, based on behavioral features in reasoning-oriented LLMs (shown in bold), performs strongly across AUROC, ECE, and Brier Score, approaching Semantic Entropy, which instead relies on multiple generations via sampling.
  • Figure 5: Our proposed methods, based on behavioral features (shown in bold), offer strong performance across AUROC, ECE, and Brier Score, approaching that of Semantic Entropy, which instead relies on multiple generations via sampling.
  • ...and 2 more figures