Table of Contents
Fetching ...

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik

TL;DR

This work introduces Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models, and provides theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals.

Abstract

Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

TL;DR

This work introduces Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models, and provides theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals.

Abstract

Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.

Paper Structure

This paper contains 27 sections, 2 theorems, 28 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Under Assumptions assump:independence, assump:uniform, assump:ideal, and assump:conf, and provided the confidences are informative i.e., $\mathbb{P}(C_t = 1/K) = 0$, the Bayesian confidence-based aggregator identifies the correct answer with probability tending to one as the number of samples $m$ gr In fact, $X_I\to 1$ and $X_k\to 0$ for all $k\neq I$ almost surely.

Figures (4)

  • Figure 1: Examples of CGES vs. SC. Top: early stopping with high confidence; Bottom: recovering a minority-but-confident answer.
  • Figure 2: Graphical model for the sampling process.
  • Figure 3: Accuracy vs. number of LLM calls ($B{=}16$) on AIME24 (a), MATH500 (b), GSM8K (c), GPQA (d), and MMLU_Pro (e). CGES achieves near-maximal accuracy with far fewer calls than self-consistency.
  • Figure 4: Two prompt templates for evaluation.

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2: Consistency under realistic confidence noise
  • proof
  • proof