Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Katie Matton, Robert Osazuwa Ness, John Guttag, Emre Kıcıman
TL;DR
The paper tackles the challenge that large language model explanations, while plausible, may be unfaithful to the model’s actual reasoning. It defines causal concept faithfulness as the alignment between true concept-level causal effects and what explanations imply, using CE based on KL divergence and EE based on explanation content, with faithfulness measured by PCC and extended to datasets via Bayesian hierarchical modeling. A two-stage estimation pipeline leverages an auxiliary LLM to identify concepts and generate counterfactuals and employs a Bayesian multinomial regression framework to estimate concept effects, alongside explanation-analysis to estimate explanation-implied effects. Empirical studies on BBQ social-bias questions and MedQA medical questions reveal systematic patterns of unfaithfulness across models (e.g., hiding safety biases or mental-status information) and demonstrate the method’s ability to diagnose not just overall faithfulness but semantic patterns of misalignment. The findings have practical implications for model evaluation, bias mitigation, and safer deployment of LLMs, and the authors provide code to enable replication and further research.
Abstract
Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.
