Table of Contents
Fetching ...

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov

TL;DR

The paper tackles unreliable uncertainty estimates from multinomial decoding in consistency-based UQ for LLMs by introducing beam search as a diverse, high-probability candidate generator. It develops a beam-weighted estimator, provides a distribution-free condition under which it outperforms multinomial sampling, and extends the approach to several existing UQ methods. Empirically, across six QA datasets and multiple models, beam-guided uncertainty yields state-of-the-art performance and reduced variance, especially for short outputs. The work offers practical guidance for deploying UQ in safety-critical LLM applications and lays groundwork for future white-box and black-box extensions.

Abstract

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

TL;DR

The paper tackles unreliable uncertainty estimates from multinomial decoding in consistency-based UQ for LLMs by introducing beam search as a diverse, high-probability candidate generator. It develops a beam-weighted estimator, provides a distribution-free condition under which it outperforms multinomial sampling, and extends the approach to several existing UQ methods. Empirically, across six QA datasets and multiple models, beam-guided uncertainty yields state-of-the-art performance and reduced variance, especially for short outputs. The work offers practical guidance for deploying UQ in safety-critical LLM applications and lays groundwork for future white-box and black-box extensions.

Abstract

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.

Paper Structure

This paper contains 53 sections, 1 theorem, 38 equations, 12 figures, 19 tables.

Key Result

Theorem 1

Let $\mathcal{B}_M(\mathbf{x})=\{\mathbf{b}^{(1)},\dots,\mathbf{b}^{(M)}\}$ be the beam set, $m_{\mathcal{B}} = \sum_{i=1}^M p(\mathbf{b}^{(i)}\mid\mathbf{x})$ be its total probability mass, and define $\mu_{\mathcal{B}}$ and $\mu_{\overline{\mathcal{B}}}$ as dissimilarity inside and outside the bea Then the beam-weighted estimator $\widehat{U}_D^{b}$ achieves smaller mean-squared error than the M

Figures (12)

  • Figure 1: Beam Search vs Multinomial Sampling. Sampling produces multiple identical generations resulting in noisy confidence estimate, while beam search covers top answers from LLM distribution resulting in a better confidence score.
  • Figure 2: Mean percentage of redundant samples (i.e., outputs already seen among earlier generations) as a function of greedy output length. Results were obtained from 2,000 questions from the TriviaQA dataset using the Gemma 3 4B base model and 10 candidate generations. Redundancy is especially high for short answers, leading to wasted computation.
  • Figure 3: Percentage of texts meeting the sufficient condition (Theorem \ref{['theorem:comp_condition']}). Results are based on 2,000 TriviaQA questions, Gemma 3 4B base and $M=10$. The green "All" bar shows the overall percentage across all lengths.
  • Figure 4: PRR ($\uparrow$ is better) as a function of the number of candidates $M$ on TriviaQA with Gemma 3 4B base. Each panel reports one estimator (Dissimilarity, Eccentricity, EigVecDissimilarity). Curves compare multinomial sampling and beam search (with probability weights from equation \ref{['eq:restricted-mass']}).
  • Figure 5: PRR ($\uparrow$ is better) for Dissimilarity under beam search (with probability weights) vs. multinomial sampling, for different output lengths. Each dataset (TriviaQA, CoQA) with Gemma 3 4B base is partitioned into five approximately equal-size bins token length of greedy output.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 1: Comparison condition for beam-weighted and Monte Carlo estimators
  • proof