Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Yuyang Dai

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Yuyang Dai

TL;DR

It is found that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges.

Abstract

Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

TL;DR

Abstract

Paper Structure (55 sections, 2 equations, 5 figures, 17 tables)

This paper contains 55 sections, 2 equations, 5 figures, 17 tables.

Introduction
Related Work
Verbalized Confidence and Calibration in LLMs
Scale Design in Psychometrics
Metacognitive Sensitivity and Signal Detection Theory
Methodology
Task Formulation
Scale Design Dimensions
Granularity ($\mathcal{G}$).
Boundary Shifting ($\mathcal{B}$).
Non-standard Ranges ($\mathcal{N}$).
Evaluation Metrics
Expected Calibration Error (ECE).
AUROC.
Metacognitive Sensitivity ($meta\text{-}d'$).
...and 40 more sections

Figures (5)

Figure 1: The overview of this paper.
Figure 2: Baseline discretization profile under the standard $[0,100]$ confidence scale. Across all six models, confidence reports are highly concentrated on a small set of round-number values. The most frequent value alone accounts for 35.6%--68.4% of responses, while the top three values cover 78.2%--92.1%. At the same time, models use only 15--28 distinct integers out of 101 possible values, revealing severe discretization of the verbalized confidence signal.
Figure 3: Sensitivity of Expected Calibration Error (ECE) to the number of bins $B$ under discretized confidence distributions. When confidence values concentrate on a small set of anchors, small changes in $B$ can shift observations across bin boundaries, producing unstable ECE estimates.
Figure 4: Distribution of raw confidence values across representative scale conditions for GPT-5.2, aggregated across datasets. Under the standard $[0,100]$ scale, confidence reports concentrate on a small set of round-number anchors, whereas alternative scale designs produce qualitatively different distributional patterns. Dashed vertical lines indicate the most frequent round-number anchors within each range.
Figure 5: Distribution of raw confidence values across representative scale conditions for LLaMA-4-Maverick, aggregated across datasets. Similar to GPT-5.2, the standard $[0,100]$ scale produces strong clustering on round-number anchors, while coarser, shifted, and non-standard scales alter the distributional pattern. Dashed vertical lines indicate the most frequent round-number anchors within each range.

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

TL;DR

Abstract

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)