When is the consistent prediction likely to be a correct prediction?
Alex Nguyen, Dheeraj Mekala, Chengyu Dong, Jingbo Shang
TL;DR
This work revisits self-consistency in LLM predictions and shows that correctness is more strongly associated with longer, computation-heavy reasoning texts than with merely repeating the most consistent answer. It demonstrates that CoT-style reasoning can spontaneously appear within longer outputs even without prompting, and that sampling longer responses yields substantial gains—up to approximately 86% of zero-shot CoT self-consistency on certain math benchmarks. The authors propose length-aware decoding and a minimum-consistency threshold to exploit this effect, validating the approach on Mixtral-8x7B and Llama-2-70B across multiple datasets. The study highlights practical implications for zero-shot reasoning, emphasizing that longer outputs are rarer and require deliberate decoding strategies to realized gains.
Abstract
Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.
