Table of Contents
Fetching ...

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria

TL;DR

This work uses a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer, and proposes a metric, Causal Faithfulness, that quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness.

Abstract

Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model's internal computations and avoiding out of distribution concerns that could otherwise undermine the validity of faithfulness assessments. We release the code in \url{https://github.com/wj210/Causal-Faithfulness}

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

TL;DR

This work uses a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer, and proposes a metric, Causal Faithfulness, that quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness.

Abstract

Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model's internal computations and avoiding out of distribution concerns that could otherwise undermine the validity of faithfulness assessments. We release the code in \url{https://github.com/wj210/Causal-Faithfulness}

Paper Structure

This paper contains 21 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Activation Patching: Given two runs, a clean run under normal conditions [left] and a corrupted run where tokens in the input are replaced such that it leads to a counterfactual scenario. [right]. AP identifies the causal effects of the hidden state at the specified token and layer position through the changes in output after inserting the activations from the clean run. The indirect effect is thus measured via the mediated effects of the intervention. meng2022locating.
  • Figure 2: [Left] Counts of instances where the modified features are assigned higher importance. [Right] Probability scores of original and counterfactual answers in the clean and corrupted (STR/GN) runs.
  • Figure 3: The model's probability scores from both clean corrupted runs are recorded and deducted from the patched scores over each token and layer. All activations from the clean run are hooked and subsequently patched in at the target location before continuing the run. AP is implemented for both outputs: answer and explanation, resulting in the final causal matrix $C$, before measuring CaF.
  • Figure 4: Pearson's correlation between plausibility and faithfulness.
  • Figure 5: Causal scores at the token level, CaF(T) across the six models on CoS-E. The cross-lines refer to patching multiple layers with window of size $10$. Each bar represents the aggregated values of the target features, red: the corrupted token spanned by $S$, blue: the answer choice corresponding to the resultant prediction and green: all answer choices.
  • ...and 9 more figures