Table of Contents
Fetching ...

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Katie Matton, Robert Osazuwa Ness, John Guttag, Emre Kıcıman

TL;DR

The paper tackles the challenge that large language model explanations, while plausible, may be unfaithful to the model’s actual reasoning. It defines causal concept faithfulness as the alignment between true concept-level causal effects and what explanations imply, using CE based on KL divergence and EE based on explanation content, with faithfulness measured by PCC and extended to datasets via Bayesian hierarchical modeling. A two-stage estimation pipeline leverages an auxiliary LLM to identify concepts and generate counterfactuals and employs a Bayesian multinomial regression framework to estimate concept effects, alongside explanation-analysis to estimate explanation-implied effects. Empirical studies on BBQ social-bias questions and MedQA medical questions reveal systematic patterns of unfaithfulness across models (e.g., hiding safety biases or mental-status information) and demonstrate the method’s ability to diagnose not just overall faithfulness but semantic patterns of misalignment. The findings have practical implications for model evaluation, bias mitigation, and safer deployment of LLMs, and the authors provide code to enable replication and further research.

Abstract

Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

TL;DR

The paper tackles the challenge that large language model explanations, while plausible, may be unfaithful to the model’s actual reasoning. It defines causal concept faithfulness as the alignment between true concept-level causal effects and what explanations imply, using CE based on KL divergence and EE based on explanation content, with faithfulness measured by PCC and extended to datasets via Bayesian hierarchical modeling. A two-stage estimation pipeline leverages an auxiliary LLM to identify concepts and generate counterfactuals and employs a Bayesian multinomial regression framework to estimate concept effects, alongside explanation-analysis to estimate explanation-implied effects. Empirical studies on BBQ social-bias questions and MedQA medical questions reveal systematic patterns of unfaithfulness across models (e.g., hiding safety biases or mental-status information) and demonstrate the method’s ability to diagnose not just overall faithfulness but semantic patterns of misalignment. The findings have practical implications for model evaluation, bias mitigation, and safer deployment of LLMs, and the authors provide code to enable replication and further research.

Abstract

Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

Paper Structure

This paper contains 39 sections, 14 equations, 18 figures, 40 tables.

Figures (18)

  • Figure 1: Dataset-level faithfulness results on BBQ. We plot the CE vs the EE for each concept, as well as faithfulness $\mathcal{F}(\mathbf{X})$ (blue line). Shaded region = 90% credible interval. GPT-3.5 produces explanations with the highest faithfulness. All models exhibit high faithfulness for Context concepts, which have low CE and low EE, but appear less faithful for Identity and Behavior.
  • Figure 2: Question-level faithfulness results for BBQ example question. For each LLM, we report the causal concept effect (CE) and the explanation-implied effect (EE) of each concept, along with the faithfulness $\mathcal{F}(\mathbf{x})$. $[\cdot,\cdot]$ = 90% credible interval. All LLMs exhibit some degree of unfaithfulness. GPT-4o receives the lowest faithfulness score. Both GPT-4o and GPT-3.5 produce explanations with unfaithful omissions of the identity concept, as seen by the concept's high CE and low EE scores.
  • Figure 2: Identity concept interventions on BBQ example question.Middle: In response to the original question, all models almost always select (B) Undetermined. Left: When the the wealth status of the individuals is removed, both GPT models frequently select the man asking for help, whereas Claude continues to select undetermined. Right: When the individuals' wealth statuses are swapped, GPT-3.5 selects the person asking for help (now described as rich) with higher probability.
  • Figure 3: Dataset-level faithfulness results on MedQA. We plot the CE vs the EE for each concept, as well as faithfulness $\mathcal{F}(\mathbf{X})$ (blue line). Shaded region = 90% credible interval. Explanations from GPT-3.5 are moderately faithful, whereas those from the other LLMs are less faithful.
  • Figure 4: Left: Causal graph of the data generating process for question $\mathbf{x}$ and model $\mathcal{M}$. $U$ is an unobserved (exogenous) variable that represents the state of the world, which gives rise to different questions $X$. $\{C_m\}_{m=1}^{M}$ are mediating variables that represent the concepts in the question context. $V$ is another mediating variable that represents all aspects of $X$ not accounted for by the concepts (e.g., style). $Y$ is $\mathcal{M}$'s answer. $\mathcal{E}$ is an unobserved variable that accounts model stochasticity. Dashed lines indicate possible causal relationships between the mediating variables. Right: Causal graph of an intervention that (1) changes the value of a concept $C_m$ to a new value and (2) keeps the values of all other concepts and of $V$ fixed.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3