Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Sepehr Kamahi; Yadollah Yaghoobzadeh

Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Sepehr Kamahi, Yadollah Yaghoobzadeh

TL;DR

This paper addresses the challenge of evaluating the faithfulness of attribution methods for autoregressive language models by introducing a counterfactual-based protocol that preserves the input distribution. It couples a counterfactual editor with a predictor and uses contrastive attributions to quantify how many token changes are needed to flip the model’s prediction, across multiple datasets and model configurations. The study demonstrates that counterfactual generators yield in-distribution text and produce consistent faithfulness rankings across editors, while traditional replacement strategies can induce OOD inputs and distort evaluations, especially for instruct-tuned models. Overall, the approach provides a principled, distribution-preserving framework for assessing attribution methods and reveals task- and model-dependent effectiveness of different FI techniques.

Abstract

Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models. Evaluating the faithfulness of an explanation method -- how accurately it explains the inner workings and decision-making of the model -- is challenging because it is difficult to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove input tokens deemed important by a particular attribution (feature importance) method and observe the resulting change in the model's output. However, for autoregressive language models, this approach creates out-of-distribution inputs due to their next-token prediction training objective. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models. Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.

Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

TL;DR

Abstract

Paper Structure (20 sections, 13 equations, 6 figures, 8 tables)

This paper contains 20 sections, 13 equations, 6 figures, 8 tables.

Introduction
Related work
Our method
Experimental Setup
Datasets
Models
Editor Models
Predictor Models
Attribution Methods
Gradient Norm
Gradient $\times$ Input
Erasure
KernelSHAP
Integrated Gradients
Results and Discussion
...and 5 more sections

Figures (6)

Figure 1: Prompting techniques used for counterfactual generation in the second phase.
Figure 2: Our process of generating counterfactuals for evaluating attribution methods. The predictor (an LM) generates a label for the given text, and an attribution method specifies the most important tokens. We mask the top n% of them and ask an editor (another LM) to change the label of the input text by filling in the masked tokens. If the attribution method is more faithful, then the required n% should be lower.
Figure 3: Creation of training examples for fine-tuning the counterfactual generator, and one given sample.
Figure 4: The top matrix presents the average correlation of attribution ranks for the fine-tuned predictor. The middle matrix shows the average correlation of attribution ranks when using an off-the-shelf instruct-tuned predictor. The bottom matrix illustrates the difference between the fine-tuned and instruct-tuned models, indicating that when editors are used as the replacement method, the difference in correlation is near zero. In contrast, using other replacement methods (i.e., <unk>, erase, <mask>, att-zero) results in significant inconsistencies between the two predictor types, likely due to the creation of out-of-distribution (OOD) text for the instruct-tuned model.
Figure 5: The difference
...and 1 more figures

Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

TL;DR

Abstract

Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)