Table of Contents
Fetching ...

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull, Mark Fishel

Abstract

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Abstract

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.
Paper Structure (35 sections, 3 equations, 4 figures, 9 tables)

This paper contains 35 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Direct comparison (AUROC vs. cost across datasets) between extended thinking (gpt-oss-20b-high) and shallow thinking (gpt-oss-20b-low).
  • Figure 2: AUROC as a function of the SC weight $\lambda$ in the hybrid SC+VC signal, shown for $K{=}2$, $5$, and $8$. Results are averaged across models and tasks within each domain, with shaded regions indicating 95% confidence intervals. Performance is stable across a wide range of $\lambda$, with degradation only at the extremes corresponding to pure VC or pure SC.
  • Figure 3: Kendall’s $\tau$ rank correlation between VC and SC as a function of the number of samples $K$ macro-averaged across reasoning models and task families. Correlation starts low and increases with sampling depth mirroring describing front-loaded gains of simple addition of the two signals and is consistently lower in mathematics than in STEM and humanities coinciding with RLVR training on math.
  • Figure 4: Overview of uncertainty instructions prompts defining VC methods. (a) Vanilla uncertainty instruction, (b) Verification uncertainty instruction and (c) epistemic uncertainty instruction. Each of the instructions is used both for elicitation and judge methods. For judge method, the epistemic uncertainty instructions are a bit different, as it needs to pay attention to the solver's reasoning trace, not its own.