Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen; Jason Phang; Alicia Parrish; Vishakh Padmakumar; Chen Zhao; Samuel R. Bowman; Kyunghyun Cho

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

TL;DR

The paper identifies two critical failures in LLMs' multi-step reasoning: hypothetical consistency (predicting the model's own outputs in related contexts) and compositional consistency (maintaining coherence when intermediate steps are replaced by their outputs). It formalizes these concepts, develops evaluation methodologies, and demonstrates that even large models like GPT-4 and GPT-3 variants show limited consistency across hypothetical and compositional transformations. Empirical results on datasets including Wikipedia, DailyDialog, arithmetic tasks, and GeoQuery reveal that compositionally consistent performance remains substantially below perfect, with sources of inconsistency traced to final-answer alignment and subtree parsing. The findings underscore the need for new training objectives and prompting techniques to improve logical reliability in complex, multi-step reasoning tasks.

Abstract

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 7 figures, 4 tables)

This paper contains 17 sections, 5 equations, 7 figures, 4 tables.

Introduction
Hypothetical Transformations
Compositional Transformations
Formalizing Consistency
Preliminaries
Composing prompts
Definitions
Evaluating Consistency on Hypothetical Transformations
Experimental Setup
All Model Sizes Perform Poorly At Distinguishing Their Own Completions
Evaluating Compositional Self-Consistency
Experimental Setup
Results
Related Work
Conclusion
...and 2 more sections

Figures (7)

Figure 1: An overview of the two types of self-consistency failures we identify in LLMs.
Figure 2: Hypothetical consistency rates on multiple-choice self-knowledge prompts for the Wikipedia and DailyDialog datasets, across the four GPT-3 model sizes. Each line is the average taken across all $k$-shot prompts, for $k\in[1,\cdots,10]$. The shaded region represents the 95% confidence interval computed with nonparametric bootstrapping. The label "number of words from original completion to distinguish" corresponds to the quantity $m$ in Table \ref{['tab:hypothetical-prompt-templates']}.
Figure 3: A more detailed breakdown of the numbers in Figure \ref{['fig:wiki-dd-sk-acc']}: the percentage of the time that each model selects each possible answer choice when prompted with a hypothetical consistency prompt, averaged across all prompts (i.e. across all $m$, the number of words that the model is asked to predict; and $k$, the number of few-shot examples). The columns labeled "Wikipedia" and "DailyDialog" correspond to the answer choice containing the completion from the original dataset. Model outputs that could not be parsed into an answer choice are not included.
Figure 4: Compositional consistency rates versus the number of in-context examples on the arithmetic and GeoQuery tasks. The shaded region represents the 95% confidence interval computed with nonparametric bootstrapping.
Figure 5: The correctness versus compositional consistency rate of each type of GPT-3 or GPT-4 model on the arithmetic task.
...and 2 more figures

Theorems & Definitions (8)

Definition 2.1: Self-consistency
Definition 2.2: Hypothetical Transformation
Definition 2.3: Compositional prompt
Definition 2.4: Compositional transformation
Definition 2.5: Hypothetical consistency
Claim 2.6
proof
Definition 2.7: Consistency over compositional transformations

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

TL;DR

Abstract

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (8)