Table of Contents
Fetching ...

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

TL;DR

The paper identifies two critical failures in LLMs' multi-step reasoning: hypothetical consistency (predicting the model's own outputs in related contexts) and compositional consistency (maintaining coherence when intermediate steps are replaced by their outputs). It formalizes these concepts, develops evaluation methodologies, and demonstrates that even large models like GPT-4 and GPT-3 variants show limited consistency across hypothetical and compositional transformations. Empirical results on datasets including Wikipedia, DailyDialog, arithmetic tasks, and GeoQuery reveal that compositionally consistent performance remains substantially below perfect, with sources of inconsistency traced to final-answer alignment and subtree parsing. The findings underscore the need for new training objectives and prompting techniques to improve logical reliability in complex, multi-step reasoning tasks.

Abstract

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

TL;DR

The paper identifies two critical failures in LLMs' multi-step reasoning: hypothetical consistency (predicting the model's own outputs in related contexts) and compositional consistency (maintaining coherence when intermediate steps are replaced by their outputs). It formalizes these concepts, develops evaluation methodologies, and demonstrates that even large models like GPT-4 and GPT-3 variants show limited consistency across hypothetical and compositional transformations. Empirical results on datasets including Wikipedia, DailyDialog, arithmetic tasks, and GeoQuery reveal that compositionally consistent performance remains substantially below perfect, with sources of inconsistency traced to final-answer alignment and subtree parsing. The findings underscore the need for new training objectives and prompting techniques to improve logical reliability in complex, multi-step reasoning tasks.

Abstract

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
Paper Structure (17 sections, 5 equations, 7 figures, 4 tables)

This paper contains 17 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An overview of the two types of self-consistency failures we identify in LLMs.
  • Figure 2: Hypothetical consistency rates on multiple-choice self-knowledge prompts for the Wikipedia and DailyDialog datasets, across the four GPT-3 model sizes. Each line is the average taken across all $k$-shot prompts, for $k\in[1,\cdots,10]$. The shaded region represents the 95% confidence interval computed with nonparametric bootstrapping. The label "number of words from original completion to distinguish" corresponds to the quantity $m$ in Table \ref{['tab:hypothetical-prompt-templates']}.
  • Figure 3: A more detailed breakdown of the numbers in Figure \ref{['fig:wiki-dd-sk-acc']}: the percentage of the time that each model selects each possible answer choice when prompted with a hypothetical consistency prompt, averaged across all prompts (i.e. across all $m$, the number of words that the model is asked to predict; and $k$, the number of few-shot examples). The columns labeled "Wikipedia" and "DailyDialog" correspond to the answer choice containing the completion from the original dataset. Model outputs that could not be parsed into an answer choice are not included.
  • Figure 4: Compositional consistency rates versus the number of in-context examples on the arithmetic and GeoQuery tasks. The shaded region represents the 95% confidence interval computed with nonparametric bootstrapping.
  • Figure 5: The correctness versus compositional consistency rate of each type of GPT-3 or GPT-4 model on the arithmetic task.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 2.1: Self-consistency
  • Definition 2.2: Hypothetical Transformation
  • Definition 2.3: Compositional prompt
  • Definition 2.4: Compositional transformation
  • Definition 2.5: Hypothetical consistency
  • Claim 2.6
  • proof
  • Definition 2.7: Consistency over compositional transformations