Table of Contents
Fetching ...

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu

TL;DR

This work systematically evaluates the faithfulness of self-generated explanations by LLMs across 75 models and 13 families using counterfactual interventions. It identifies limitations of correlation-based faithfulness metrics and introduces two robust measures: phi-CCT (a probability-free surrogate for CCT) and F-AUROC (which mitigates sensitivity to intervention imbalances and verbosity). The study reveals a clear scaling law: as models become larger and more capable, their explanations become more faithful across multiple metrics, with F-AUROC showing the strongest association to task performance. These findings support the viability of scaling strategies for safer, more transparent self-explanations and provide practical metrics and tooling for evaluating faithfulness across diverse models and prompts.

Abstract

When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) which avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model's ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. Our code is available at https://github.com/google-deepmind/corr_faith.

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

TL;DR

This work systematically evaluates the faithfulness of self-generated explanations by LLMs across 75 models and 13 families using counterfactual interventions. It identifies limitations of correlation-based faithfulness metrics and introduces two robust measures: phi-CCT (a probability-free surrogate for CCT) and F-AUROC (which mitigates sensitivity to intervention imbalances and verbosity). The study reveals a clear scaling law: as models become larger and more capable, their explanations become more faithful across multiple metrics, with F-AUROC showing the strongest association to task performance. These findings support the viability of scaling strategies for safer, more transparent self-explanations and provide practical metrics and tooling for evaluating faithfulness across diverse models and prompts.

Abstract

When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) which avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model's ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. Our code is available at https://github.com/google-deepmind/corr_faith.

Paper Structure

This paper contains 35 sections, 11 theorems, 14 equations, 12 figures, 1 table.

Key Result

Theorem 1

CT is $1$-gameable on all datasets.

Figures (12)

  • Figure 1: phi-CCT predicts CCT. Across our experimental settings, CCT is largely predicted by our simpler phi-CCT (left, $R^2=.92$). The original CT, by contrast, is only very weakly predictive of the CCT (right, $R^2=.09$). Each point represents statistics computed for a given dataset, model, and prompt setting (\ref{['subsec:prompts']}). Colors show model parameter counts.
  • Figure 2: Random interventions rarely change model predictions. Density histogram of continuous prediction impact ($\mathcal{I}_C$) for each dataset across all models, note the log y-axis scale. Color shows the fraction of examples in each bar where the model's top predicted class changed ($\mathcal{I}_D$). $\mathcal{I}_C$ compares token probabilities of class labels; when models generate explanations first (bottom), their predictions are conditioned on these explanations and therefore tend to have higher confidences, leading to fewer intermediate-impact interventions.
  • Figure 3: Correlation is sensitive to class imbalance. Contours show the phi-coefficient between labels and predictions, for a given TPR and FPR (\ref{['eq:phi_vs_roc']}). P/N shows the ratio of positive to negative examples in the dataset. While TPR and FPR (and derived metrics such as AUROC) are independent of class frequency, correlation gives additional weight to predictions on more common classes. For example, when positive examples are very rare (P/N=0.01), a classifier must achieve very low FPR to attain high correlation, regardless of TPR.
  • Figure 4: (Top) Prompting Qwen 2.5 72B-Instruct to generate concise responses appears to yield more faithfulness than prompting it to generate comprehensive responses, according to both the CCT and phi-CCT. (Bottom) By showing TPR (how frequently impactful interventions are mentioned in explanations) and FPR (how frequently non-impactful interventions are mentioned) over a phi-CCT contour plot, we can see the effect of imbalanced interventions: because impactful interventions ($\mathcal{I}_D=1$) are rare, correlation penalizes models more for false positives (mentioning non-impactful interventions) than false negatives (failing to mention impactful interventions). This effect is most pronounced on ComVE, where only 1.4% of interventions change Qwen's predicted class.
  • Figure 5: Task accuracy vs. parameter count of evaluated IT models. Accuracies increase with parameter count within families, though there can be significant differences across different families at a given parameter count. When a model fails to produce a response that matches the expected format, we consider the response incorrect; some of the smallest models cannot format their responses and therefore perform worse than random guessing. See \ref{['fig:accuracy_by_family']} for accuracy evaluations for different prompting strategies, including PT models.
  • ...and 7 more figures

Theorems & Definitions (17)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 4
  • proof
  • Lemma 1
  • proof
  • Theorem 4
  • proof
  • ...and 7 more