When is the consistent prediction likely to be a correct prediction?

Alex Nguyen; Dheeraj Mekala; Chengyu Dong; Jingbo Shang

When is the consistent prediction likely to be a correct prediction?

Alex Nguyen, Dheeraj Mekala, Chengyu Dong, Jingbo Shang

TL;DR

This work revisits self-consistency in LLM predictions and shows that correctness is more strongly associated with longer, computation-heavy reasoning texts than with merely repeating the most consistent answer. It demonstrates that CoT-style reasoning can spontaneously appear within longer outputs even without prompting, and that sampling longer responses yields substantial gains—up to approximately 86% of zero-shot CoT self-consistency on certain math benchmarks. The authors propose length-aware decoding and a minimum-consistency threshold to exploit this effect, validating the approach on Mixtral-8x7B and Llama-2-70B across multiple datasets. The study highlights practical implications for zero-shot reasoning, emphasizing that longer outputs are rarer and require deliberate decoding strategies to realized gains.

Abstract

Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.

When is the consistent prediction likely to be a correct prediction?

TL;DR

Abstract

Paper Structure (18 sections, 12 figures, 2 tables)

This paper contains 18 sections, 12 figures, 2 tables.

Introduction
Experiment Setup
Consistent Predictions via Longer Reasoning Texts are more likely to be correct
Self-Consistency with a Minimum Consistency Threshold
Self-Consistency Performance vs Minimum consistency Threshold Analysis
Blurting vs Reasoning Analysis
Likelihood Analysis
Related Work
CoT Reasoning
Self-Consistency
Conclusion
Limitations
Ethical Considerations
Appendix
More details on Experimental Settings
...and 3 more sections

Figures (12)

Figure 1: The average frequency of the most consistent answer per bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the GSM8K dataset.
Figure 2: The average accuracy of the most consistent answer per bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the GSM8K dataset.
Figure 3: The average percentage of CoT-style reasoning texts in each bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the GSM8K dataset.
Figure 4: The average frequency of the most consistent answer per token length bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the MultiArith dataset.
Figure 5: The average accuracy of the most consistent answer per token length bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the MultiArith dataset.
...and 7 more figures

When is the consistent prediction likely to be a correct prediction?

TL;DR

Abstract

When is the consistent prediction likely to be a correct prediction?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)