Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

Juming Xiong; Kevin Guo; Congning Ni; Chao Yan; Katherine Brown; Avinash Baidya; Xiang Gao; Bradley Marlin; Zhijun Yin

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Marlin, Zhijun Yin

TL;DR

A confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning, demonstrating that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.

Abstract

Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80\% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

TL;DR

Abstract

Paper Structure (27 sections, 6 equations, 11 figures, 4 tables)

This paper contains 27 sections, 6 equations, 11 figures, 4 tables.

Introduction
Related Work
Reasoning in Large Language Models
Uncertainty Estimation in Reasoning
Adaptive and Early-Exit Reasoning
Method
Sentence-Level Per-Choice Prediction from Logits
Decision Framework
Feature Extraction
Numeric trajectory features.
Linguistic features.
Model Architecture
Experiments
Datasets
Models:
...and 12 more sections

Figures (11)

Figure 1: Overview of the decision framework.For each question, the language model $L$ generates a complete reasoning trajectory $S_1, S_2, \dots, S_n$. From this trajectory, sentence-wise numeric and linguistic features are extracted to form a temporal feature sequence. An attention-based recurrent decision model then analyzes this sequence and estimates the probability $P$ that the greedy reasoning path leads to a correct final answer. A confidence threshold $\tau$ is used to determine whether additional multi-path reasoning is necessary: $P > \tau \Rightarrow$likely correct (accept greedy output), while $P < \tau \Rightarrow$likely wrong (apply multi-path reasoning such as dynamic voting for enhancement).
Figure 2: Confidence Threshold Calibration. Accuracy (blue, % of DV) and token reduction (orange, % vs. DV) versus confidence threshold $\tau$ on MedQA, MathQA, MedMCQA, and MMLU. $\tau=1.0$ is the DV baseline (100% accuracy, 0% reduction).
Figure 3: Distribution of accuracy and token usage across datasets. Boxplots show the performance of SC, CER, DV, and Ours on MedQA, MathQA, MedMCQA, and MMLU using GPT-OSS 20B. The top panel reports accuracy, while the bottom panel reports average token usage. Statistical significance is evaluated using paired bootstrap with 2,000 resamples; $n.s.$ indicates not significant and $^{*}$ indicates $p<0.05$. Higher accuracy is better, whereas lower token usage indicates greater efficiency.
Figure 4: Confidence Threshold Calibration for Llama 3.1 8B. Accuracy (blue, % of DV) and token reduction (orange, % vs. DV) versus confidence threshold $\tau$ on MedQA, MathQA, MedMCQA, and MMLU. $\tau=1.0$ is the DV baseline (100% accuracy, 0% reduction).
Figure 5: Confidence Threshold Calibration for Qwen 2.5 7B. Accuracy (blue, % of DV) and token reduction (orange, % vs. DV) versus confidence threshold $\tau$ on MedQA, MathQA, MedMCQA, and MMLU. $\tau=1.0$ is the DV baseline (100% accuracy, 0% reduction).
...and 6 more figures

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

TL;DR

Abstract

Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)