Table of Contents
Fetching ...

When is the consistent prediction likely to be a correct prediction?

Alex Nguyen, Dheeraj Mekala, Chengyu Dong, Jingbo Shang

TL;DR

This work revisits self-consistency in LLM predictions and shows that correctness is more strongly associated with longer, computation-heavy reasoning texts than with merely repeating the most consistent answer. It demonstrates that CoT-style reasoning can spontaneously appear within longer outputs even without prompting, and that sampling longer responses yields substantial gains—up to approximately 86% of zero-shot CoT self-consistency on certain math benchmarks. The authors propose length-aware decoding and a minimum-consistency threshold to exploit this effect, validating the approach on Mixtral-8x7B and Llama-2-70B across multiple datasets. The study highlights practical implications for zero-shot reasoning, emphasizing that longer outputs are rarer and require deliberate decoding strategies to realized gains.

Abstract

Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.

When is the consistent prediction likely to be a correct prediction?

TL;DR

This work revisits self-consistency in LLM predictions and shows that correctness is more strongly associated with longer, computation-heavy reasoning texts than with merely repeating the most consistent answer. It demonstrates that CoT-style reasoning can spontaneously appear within longer outputs even without prompting, and that sampling longer responses yields substantial gains—up to approximately 86% of zero-shot CoT self-consistency on certain math benchmarks. The authors propose length-aware decoding and a minimum-consistency threshold to exploit this effect, validating the approach on Mixtral-8x7B and Llama-2-70B across multiple datasets. The study highlights practical implications for zero-shot reasoning, emphasizing that longer outputs are rarer and require deliberate decoding strategies to realized gains.

Abstract

Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.
Paper Structure (18 sections, 12 figures, 2 tables)

This paper contains 18 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The average frequency of the most consistent answer per bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the GSM8K dataset.
  • Figure 2: The average accuracy of the most consistent answer per bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the GSM8K dataset.
  • Figure 3: The average percentage of CoT-style reasoning texts in each bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the GSM8K dataset.
  • Figure 4: The average frequency of the most consistent answer per token length bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the MultiArith dataset.
  • Figure 5: The average accuracy of the most consistent answer per token length bucket obtained via both Mixtral-8x7B and Llama-2 70B models on the MultiArith dataset.
  • ...and 7 more figures