Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho
TL;DR
The paper identifies two critical failures in LLMs' multi-step reasoning: hypothetical consistency (predicting the model's own outputs in related contexts) and compositional consistency (maintaining coherence when intermediate steps are replaced by their outputs). It formalizes these concepts, develops evaluation methodologies, and demonstrates that even large models like GPT-4 and GPT-3 variants show limited consistency across hypothetical and compositional transformations. Empirical results on datasets including Wikipedia, DailyDialog, arithmetic tasks, and GeoQuery reveal that compositionally consistent performance remains substantially below perfect, with sources of inconsistency traced to final-answer alignment and subtree parsing. The findings underscore the need for new training objectives and prompting techniques to improve logical reliability in complex, multi-step reasoning tasks.
Abstract
Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
