Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
Noah Y. Siegel, Nicolas Heess, Maria Perez-Ortiz, Oana-Maria Camburu
TL;DR
This work systematically evaluates the faithfulness of self-generated explanations by LLMs across 75 models and 13 families using counterfactual interventions. It identifies limitations of correlation-based faithfulness metrics and introduces two robust measures: phi-CCT (a probability-free surrogate for CCT) and F-AUROC (which mitigates sensitivity to intervention imbalances and verbosity). The study reveals a clear scaling law: as models become larger and more capable, their explanations become more faithful across multiple metrics, with F-AUROC showing the strongest association to task performance. These findings support the viability of scaling strategies for safer, more transparent self-explanations and provide practical metrics and tooling for evaluating faithfulness across diverse models and prompts.
Abstract
When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) which avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model's ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. Our code is available at https://github.com/google-deepmind/corr_faith.
