Table of Contents
Fetching ...

Chain-of-Thought Unfaithfulness as Disguised Accuracy

Oliver Bentham, Nathan Stringham, Ana Marasović

TL;DR

Within a single family of proprietary models, Lanham et al. (2023) find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size.

Abstract

Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated ($R^2$=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.

Chain-of-Thought Unfaithfulness as Disguised Accuracy

TL;DR

Within a single family of proprietary models, Lanham et al. (2023) find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size.

Abstract

Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated (=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.
Paper Structure (21 sections, 3 equations, 10 figures, 11 tables)

This paper contains 21 sections, 3 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Model size vs. unfaithfulness results reported in lanham2023measuring.
  • Figure 2: lanham2023measuring’s unfaithfulness, our normalized unfaithfulness, and CoT prompting accuracy across different model sizes for Pythia DPO and Llama 2 model families; see §\ref{['sec:experimental_setup']} for details. Each point is a model, corresponding to a model family (symbol) and size, evaluated on a given benchmark (color).
  • Figure 3: lanham2023measuring's unfaithfulness and task accuracy for 2- and 3-digit addition problems containing 2, 4, 8, or 16 operands using Llama 2. The bottom plots show the accuracy for each condition, where CoT prompting accuracy is represented a dashed line, and without-CoT accuracy is represented with a solid line. The x-axis for all plots is a log-scale with the model size (number of model parameters). For both tasks the optimally faithful model according to the metric occurs at 13B; however, this might be due to the sparse nature of the x-axis
  • Figure 4: Task accuracy vs. lanham2023measuring's unfaithfulness metric on the left and task accuracy vs. normalized unfaithfulness on the right. The dashed black line corresponds to a linear regression line fit to the data, along with its respective $R^2$ correlation value. The correlation between accuracy and unfaithfulness is minimal ($R^2 = 0.025$), but strong ($R^2 = 0.740$) between accuracy and normalized unfaithfulness.
  • Figure 5: Illustration of the same vs. different ordering conditions for MCQs.
  • ...and 5 more figures