Table of Contents
Fetching ...

Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Prateek Chhikara

TL;DR

The paper tackles the miscalibration and overconfidence of large language models in factual QA. It introduces a distractor-augmented prompting framework to jointly assess accuracy and confidence across diverse models and datasets. Across nine LLMs and three QA benchmarks, structured distractors substantially improve calibration and accuracy, with nuanced effects depending on model size, tuning regime, and task type. The authors provide concrete recommendations on prompt design, fine-tuning strategies, and model selection, plus an evaluation framework to guide trustworthy LLM deployments in high-stakes settings.

Abstract

Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.

Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

TL;DR

The paper tackles the miscalibration and overconfidence of large language models in factual QA. It introduces a distractor-augmented prompting framework to jointly assess accuracy and confidence across diverse models and datasets. Across nine LLMs and three QA benchmarks, structured distractors substantially improve calibration and accuracy, with nuanced effects depending on model size, tuning regime, and task type. The authors provide concrete recommendations on prompt design, fine-tuning strategies, and model selection, plus an evaluation framework to guide trustworthy LLM deployments in high-stakes settings.

Abstract

Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.
Paper Structure (34 sections, 4 equations, 7 figures, 4 tables)

This paper contains 34 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An instance from SimpleQA dataset where an LLM assigns high confidence to an incorrect answer.
  • Figure 2: Reliability diagrams (RDs) showing calibration performance in $\mathcal{N}$ ($\bullet$) and $\mathcal{D}$ ($\bullet$) settings on the SimpleQA dataset. (y-axis: actual accuracy, x-axis: predicted confidence)
  • Figure 3: Accuracy and calibration shifts with distractors. We show relative accuracy gains (bars) and ECE changes (points) when distractor options are added. While all models improve in accuracy, calibration effects vary—large models benefit most, while smaller or models often remain miscalibrated.
  • Figure 4: Performance (correct) of LLMs across different question types in both $\mathcal{N}$ ($\bullet$) and $\mathcal{D}$ ($\bullet$) settings.
  • Figure 5: Reliability diagrams (RDs) on SimpleQA dataset showing calibration performance in $\mathcal{N}$ ($\bullet$) and $\mathcal{D}$ ($\bullet$) settings. The numbers on top of bars represent the number of correctly predicted instances (y-axis: actual accuracy, x-axis: predicted confidence). Here the LLM judge model is same as the prediction model.
  • ...and 2 more figures