Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
Prateek Chhikara
TL;DR
The paper tackles the miscalibration and overconfidence of large language models in factual QA. It introduces a distractor-augmented prompting framework to jointly assess accuracy and confidence across diverse models and datasets. Across nine LLMs and three QA benchmarks, structured distractors substantially improve calibration and accuracy, with nuanced effects depending on model size, tuning regime, and task type. The authors provide concrete recommendations on prompt design, fine-tuning strategies, and model selection, plus an evaluation framework to guide trustworthy LLM deployments in high-stakes settings.
Abstract
Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.
