ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models

Zhixue Zhao; Boxuan Shan

ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models

Zhixue Zhao, Boxuan Shan

TL;DR

ReAGent tackles the challenge of faithful feature attribution for decoder-only generative LMs by introducing a model-agnostic, gradient-free FA that recursively updates token importances through RoBERTa-based replacements of context tokens. The method computes the impact on the next-token distribution, $\Delta p_t = p^{(o)}_t - p^{(r)}_t$, and aggregates these signals to produce a per-token importance distribution without accessing model weights. Across three generation tasks and six decoders from two families, ReAGent consistently yields higher faithfulness (Soft-NS/Soft-NC) than seven established FAs, especially on LongRA token-level analyses and OPT models, while maintaining robustness to hyper-parameter choices. The approach enables faithful explanations for black-box generative LMs and reduces computational overhead by avoiding gradients or fine-tuning, with practical implications for interpretability and responsible deployment of large language models. This work therefore provides a versatile, scalable avenue for understanding how input tokens influence generation in decoder-only architectures, with potential extensions to other modalities and generation tasks. $\Delta p_t$ encapsulates the core fidelity signal driving the attribution updates throughout the recursion.

Abstract

Feature attribution methods (FAs), such as gradients and attention, are widely employed approaches to derive the importance of all input features to the model predictions. Existing work in natural language processing has mostly focused on developing and testing FAs for encoder-only language models (LMs) in classification tasks. However, it is unknown if it is faithful to use these FAs for decoder-only models on text generation, due to the inherent differences between model architectures and task settings respectively. Moreover, previous work has demonstrated that there is no `one-wins-all' FA across models and tasks. This makes the selection of a FA computationally expensive for large LMs since input importance derivation often requires multiple forward and backward passes including gradient computations that might be prohibitive even with access to large compute. To address these issues, we present a model-agnostic FA for generative LMs called Recursive Attribution Generator (ReAGent). Our method updates the token importance distribution in a recursive manner. For each update, we compute the difference in the probability distribution over the vocabulary for predicting the next token between using the original input and using a modified version where a part of the input is replaced with RoBERTa predictions. Our intuition is that replacing an important token in the context should have resulted in a larger change in the model's confidence in predicting the token than replacing an unimportant token. Our method can be universally applied to any generative LM without accessing internal model weights or additional training and fine-tuning, as most other FAs require. We extensively compare the faithfulness of ReAGent with seven popular FAs across six decoder-only LMs of various sizes. The results show that our method consistently provides more faithful token importance distributions.

ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models

TL;DR

, and aggregates these signals to produce a per-token importance distribution without accessing model weights. Across three generation tasks and six decoders from two families, ReAGent consistently yields higher faithfulness (Soft-NS/Soft-NC) than seven established FAs, especially on LongRA token-level analyses and OPT models, while maintaining robustness to hyper-parameter choices. The approach enables faithful explanations for black-box generative LMs and reduces computational overhead by avoiding gradients or fine-tuning, with practical implications for interpretability and responsible deployment of large language models. This work therefore provides a versatile, scalable avenue for understanding how input tokens influence generation in decoder-only architectures, with potential extensions to other modalities and generation tasks.

encapsulates the core fidelity signal driving the attribution updates throughout the recursion.

Abstract

Paper Structure (29 sections, 10 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Post-hoc FAs
FAs for Generative Models
Preliminaries
Generative Language Modeling
Input Importance for Generative LMs
Recursive Attribution Generator (ReAGent)
Computing Importance Scores
Step 3: Context tokens to be replaced
Step 4: Replacement tokens
Steps 5 & 6: Updating importance scores
Step 2: Stopping Condition
Experimental Setup
Datasets
...and 14 more sections

Figures (5)

Figure 1: Input importance distributions for a generative task (top) and a classification task (bottom) using a toy FA.
Figure 2: Token-level faithfulness on LongRA. Values that are close to zero indicate its faithfulness is on par with the random baseline.
Figure 3: Sequence-level faithfulness, i.e. Soft-NS and Soft-NC, on the two datasets: WikiBio and TellMeWhy. Values that are close to zero indicate its faithfulness is on par with the random baseline.
Figure 4: Importance distribution over the input: "As soon as I arrived in Tennessee, I checked into my hotel, and watched a movie before falling asleep. (I had a great call with my husband, although I wish it were longer). I was staying in my favorite city, ". The sentence in () is the distractor. The model predicts "Nashville" regardless of whether the input includes the distractor or not.
Figure 5: Sufficiency and Comprehensiveness scores on different updating steps.

ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models

TL;DR

Abstract

ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)