Table of Contents
Fetching ...

Dissociation of Faithful and Unfaithful Reasoning in LLMs

Evelyn Yee, Alice Li, Chenyu Tang, Yeon Ho Jung, Ramamohan Paturi, Leon Bergen

TL;DR

This work identifies factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer, and identifies mechanisms driving faithful and unfaithful error recoveries.

Abstract

Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.

Dissociation of Faithful and Unfaithful Reasoning in LLMs

TL;DR

This work identifies factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer, and identifies mechanisms driving faithful and unfaithful error recoveries.

Abstract

Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.
Paper Structure (36 sections, 16 figures, 5 tables)

This paper contains 36 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Our querying and error recovery evaluation pipeline for errored chain of thought. $<$Questions, Target Answer$>$ pairs are sampled from the original dataset. For a single evaluation, the same model is used for each "LLM" part of the pipeline.
  • Figure 2: An example stimulus from the ASDiv Calculation Error set for GPT-4 (lightly edited for clarity), with demonstrations of the potential error recovery behaviors. The error is highlighted in red, demonstration of faithful recovery is highlighted in green, and unfaithful recovery behaviors are highlighted in blue. The model's final answer to the question is boxed.
  • Figure 3: Overall error recovery rates (as a proportion of all responses) from small errors and large errors. Error bars indicate 95% binomial confidence intervals.
  • Figure 4: Difference between large error and small error recovery rates, as a proportion of all responses. Negative values indicate recoveries occurred more often for small errors. Error bars are 95% confidence intervals.
  • Figure 5: Difference in recovery rates (as a proportion of all responses) between context noise and baseline conditions. Negative values indicate recoveries occurred more often in the baseline condition. Error bars are 95% confidence intervals.
  • ...and 11 more figures