Table of Contents
Fetching ...

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Sean Wu, Fredrik K. Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, David A. Clifton

Abstract

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Abstract

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top- confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

Paper Structure

This paper contains 37 sections, 3 theorems, 24 equations, 7 figures, 6 tables.

Key Result

Theorem 2.1

Let $s \in [0, 1)$ be the reported model confidence and $p \in [0, 1]$ the true probability of correctness. The expected BAS utility (Eq. eq:expected_bas) is uniquely maximized when $s = p$ for all $p < 1$. For $p=1$, the expected utility is strictly increasing in $s$ and achieves its supremum as $s

Figures (7)

  • Figure 1: We introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric derived from an explicit answer-or-abstain utility model, and use it to evaluate LLM confidence reliability across a diverse set of models and tasks. Reliability varies substantially across tasks, and while larger and more accurate models tend to also achieve higher BAS, even frontier models remain prone to overconfidence on complex, open-ended tasks.
  • Figure 2: Relationship between model scale, predictive performance, and confidence reliability. Top: Model size vs. accuracy and reliability metrics. Larger models tend to achieve higher accuracy and improved confidence reliability (higher BAS, lower ECE and AURC), although substantial variation remains across models. Bottom: Accuracy vs. reliability metrics. Models with higher accuracy tend to also exhibit more reliable confidence estimates.
  • Figure A1: Relationship between BAS and the standard reliability metrics (ECE, AURC) across all model-dataset pairs. While BAS is correlated with ECE and AURC, several points deviate substantially from the trend, indicating that models with similar confidence calibration or ranking can exhibit highly different decision-level reliability.
  • Figure A2: Distribution of predicted confidence values for Llama 3.3 and Mistral (M) on AIME and SimpleQA, separated by correctness. While both models exhibit high-confidence predictions, Llama 3.3 assigns extremely high confidence to a larger number of incorrect answers. These rare but highly overconfident errors are strongly penalized by BAS but contribute only modestly to ECE, explaining the observed discrepancies between the metrics.
  • Figure A3: Relationship between BAS and the proper scoring rule log loss across all model-dataset pairs. The two metrics are highly correlated in practice, reflecting their shared sensitivity to high-confidence errors and the clear tendency of modern LLMs to exhibit overconfidence rather than underconfidence. However, BAS differs structurally by imposing an asymmetric penalty that strongly prioritizes avoiding overconfident errors, enabling it to distinguish between models that trade off overconfidence with underconfidence.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 2.1: Optimality of BAS Utility
  • Theorem A.1: Optimality of BAS Utility
  • proof
  • Theorem B.1: Optimality under Weighted Risk
  • proof