Table of Contents
Fetching ...

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Yuyang Dai

TL;DR

It is found that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges.

Abstract

Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

TL;DR

It is found that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges.

Abstract

Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.
Paper Structure (55 sections, 2 equations, 5 figures, 17 tables)

This paper contains 55 sections, 2 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: The overview of this paper.
  • Figure 2: Baseline discretization profile under the standard $[0,100]$ confidence scale. Across all six models, confidence reports are highly concentrated on a small set of round-number values. The most frequent value alone accounts for 35.6%--68.4% of responses, while the top three values cover 78.2%--92.1%. At the same time, models use only 15--28 distinct integers out of 101 possible values, revealing severe discretization of the verbalized confidence signal.
  • Figure 3: Sensitivity of Expected Calibration Error (ECE) to the number of bins $B$ under discretized confidence distributions. When confidence values concentrate on a small set of anchors, small changes in $B$ can shift observations across bin boundaries, producing unstable ECE estimates.
  • Figure 4: Distribution of raw confidence values across representative scale conditions for GPT-5.2, aggregated across datasets. Under the standard $[0,100]$ scale, confidence reports concentrate on a small set of round-number anchors, whereas alternative scale designs produce qualitatively different distributional patterns. Dashed vertical lines indicate the most frequent round-number anchors within each range.
  • Figure 5: Distribution of raw confidence values across representative scale conditions for LLaMA-4-Maverick, aggregated across datasets. Similar to GPT-5.2, the standard $[0,100]$ scale produces strong clustering on round-number anchors, while coarser, shifted, and non-standard scales alter the distributional pattern. Dashed vertical lines indicate the most frequent round-number anchors within each range.