Table of Contents
Fetching ...

ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently

Jaeyeon Lee, Guantong Qi, Matthew Brady Neeley, Zhandong Liu, Hyun-Hwan Jeong

TL;DR

This work addresses the high token cost of LLM reasoning by introducing SPRT-based dynamic stopping to identify a dominant reasoning path with fewer samples than traditional self-consistency methods. By modeling LLM outputs as a categorical distribution and reducing the problem to a two-response Bernoulli test, the authors develop SPRT and Mixture SPRT frameworks calibrated for small probability differences, ensuring controlled Type I and II errors. The proposed methods achieve comparable accuracy to self-consistency while delivering substantial token reductions across benchmarks, validating improved token efficiency in practice. The approach is model- and benchmark-agnostic, with publicly available code and datasets to promote reproducibility and further advances in efficient LLM reasoning.

Abstract

Recent advancements in large language models (LLMs) integrating explicit reasoning, such as OpenAI's o3-mini, DeepSeek-R1, and QWQ-32B, enable smaller models to solve complex tasks by generating intermediate reasoning steps prior to providing answers. However, this approach significantly increases computational costs, both monetarily and environmentally. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy, often requiring between 40 to 64 samples per task. Although aggregation effectively reduces variance and bias, additional sampling can lead to diminishing returns when early samples yield consistent results. To address inefficiencies, we propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved. We calibrate SPRT parameters specifically for LLM applications, accounting for sensitivity to detect the mode of the distribution. Our experiments demonstrate that incorporating SPRT significantly enhances token efficiency, achieving comparable accuracy to self-consistency methods but at a substantially reduced computational cost. To promote transparency and facilitate reproducibility, we have made the source code and datasets used in our experiments publicly available at our GitHub repository: https://github.com/LiuzLab/consol, or available as a PyPI package: pip install consol. We hope that this resource will support further research and encourage the development of new methods building upon our work.

ConSol: Sequential Probability Ratio Testing to Find Consistent LLM Reasoning Paths Efficiently

TL;DR

This work addresses the high token cost of LLM reasoning by introducing SPRT-based dynamic stopping to identify a dominant reasoning path with fewer samples than traditional self-consistency methods. By modeling LLM outputs as a categorical distribution and reducing the problem to a two-response Bernoulli test, the authors develop SPRT and Mixture SPRT frameworks calibrated for small probability differences, ensuring controlled Type I and II errors. The proposed methods achieve comparable accuracy to self-consistency while delivering substantial token reductions across benchmarks, validating improved token efficiency in practice. The approach is model- and benchmark-agnostic, with publicly available code and datasets to promote reproducibility and further advances in efficient LLM reasoning.

Abstract

Recent advancements in large language models (LLMs) integrating explicit reasoning, such as OpenAI's o3-mini, DeepSeek-R1, and QWQ-32B, enable smaller models to solve complex tasks by generating intermediate reasoning steps prior to providing answers. However, this approach significantly increases computational costs, both monetarily and environmentally. The widely-used self-consistency method further exacerbates these costs by aggregating multiple reasoning paths to improve accuracy, often requiring between 40 to 64 samples per task. Although aggregation effectively reduces variance and bias, additional sampling can lead to diminishing returns when early samples yield consistent results. To address inefficiencies, we propose leveraging Sequential Probability Ratio Testing (SPRT) to dynamically terminate sampling once sufficient consistency is achieved. We calibrate SPRT parameters specifically for LLM applications, accounting for sensitivity to detect the mode of the distribution. Our experiments demonstrate that incorporating SPRT significantly enhances token efficiency, achieving comparable accuracy to self-consistency methods but at a substantially reduced computational cost. To promote transparency and facilitate reproducibility, we have made the source code and datasets used in our experiments publicly available at our GitHub repository: https://github.com/LiuzLab/consol, or available as a PyPI package: pip install consol. We hope that this resource will support further research and encourage the development of new methods building upon our work.

Paper Structure

This paper contains 31 sections, 24 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Distribution of Probabilities and Entropy for OpenAI's o3-mini Reasoning Models' Responses. We observe that a majority of the $p_1$ values are above $0.5$, indicating that a single response often dominates the others. This tendency becomes more pronounced when stronger models are used relative to the difficulty of the benchmark (Left). The ratio between $p_1$ and $p_2$ is also skewed to the right. As the models become stronger, dominant responses become even more prevalent (Middle). Finally, entropy decreases with model strength, indicating reduced randomness in responses by stronger models (Right). Note: the distributions are based on 40 samples.
  • Figure 2: Scatterplot for the two most frequent responses by Open AI o3-mini reasoning models. The color represents the correctness of the most frequent answers. When $p_1$ is less than 0.5, it's unlikely that either of the two most frequent responses is the correct answer. This observation justifies our approaches to allow early stopping by accepting $H_0$, concluding there's no dominating response, even with a weak evidence.
  • Figure 3: Comparison of consistency and accuracy for different methods on the AIME24 benchmark with the o3-mini-low model as a function of the average number of runs. For Self-Consistency, the number of samples varies between $1$ and $40$. For Adaptive-Consistency, the confidence threshold parameter ranges from $0.74$ to $0.9999$. For Mixture of SPRT, the parameter $\beta$ ranges from $0.94979$ to $0.94997$.
  • Figure S1: Boxplot of the generated token distribution for different models with reasoning (DeepSeek-R1-1.5B and 8B) and without reasoning (llama-3.3-70b-instruct, llama-3.1-405b-instruct, and qwen2.5-32b-instruct) in solving the AIME24 problem. Each dot represents an answer to a question for the corresponding model.
  • Figure S2: Scatterplot of the two most frequent responses for MedQA and GSM8K by OpenAI GPT-4o-mini with CoT.
  • ...and 1 more figures