Chain-of-Thought Unfaithfulness as Disguised Accuracy

Oliver Bentham; Nathan Stringham; Ana Marasović

Chain-of-Thought Unfaithfulness as Disguised Accuracy

Oliver Bentham, Nathan Stringham, Ana Marasović

TL;DR

Within a single family of proprietary models, Lanham et al. (2023) find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size.

Abstract

Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated ($R^2$=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.

Chain-of-Thought Unfaithfulness as Disguised Accuracy

TL;DR

Abstract

=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.

Paper Structure (21 sections, 3 equations, 10 figures, 11 tables)

This paper contains 21 sections, 3 equations, 10 figures, 11 tables.

Introduction
Related Work
Faithfulness Tests
Scaling Laws
Experimental Setup
lanham2023measuring Unfaithfulness
Normalized lanham2023measuring Unfaithfulness
Models
Multiple Choice Benchmarks
Addition Task
Results
Do we observe inverse scaling in faithfulness once models become sufficiently capable across different model families?
Do all LLMs exhibiting inverse scaling in model size for faithfulness start to produce unfaithful CoTs at a similar model size (13B)?
Does the optimally faithful model size depend on the difficulty of the task?
Discussion and Conclusions
...and 6 more sections

Figures (10)

Figure 1: Model size vs. unfaithfulness results reported in lanham2023measuring.
Figure 2: lanham2023measuring’s unfaithfulness, our normalized unfaithfulness, and CoT prompting accuracy across different model sizes for Pythia DPO and Llama 2 model families; see §\ref{['sec:experimental_setup']} for details. Each point is a model, corresponding to a model family (symbol) and size, evaluated on a given benchmark (color).
Figure 3: lanham2023measuring's unfaithfulness and task accuracy for 2- and 3-digit addition problems containing 2, 4, 8, or 16 operands using Llama 2. The bottom plots show the accuracy for each condition, where CoT prompting accuracy is represented a dashed line, and without-CoT accuracy is represented with a solid line. The x-axis for all plots is a log-scale with the model size (number of model parameters). For both tasks the optimally faithful model according to the metric occurs at 13B; however, this might be due to the sparse nature of the x-axis
Figure 4: Task accuracy vs. lanham2023measuring's unfaithfulness metric on the left and task accuracy vs. normalized unfaithfulness on the right. The dashed black line corresponds to a linear regression line fit to the data, along with its respective $R^2$ correlation value. The correlation between accuracy and unfaithfulness is minimal ($R^2 = 0.025$), but strong ($R^2 = 0.740$) between accuracy and normalized unfaithfulness.
Figure 5: Illustration of the same vs. different ordering conditions for MCQs.
...and 5 more figures

Chain-of-Thought Unfaithfulness as Disguised Accuracy

TL;DR

Abstract

Chain-of-Thought Unfaithfulness as Disguised Accuracy

Authors

TL;DR

Abstract

Table of Contents

Figures (10)