Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

Jingtan Wang; Xiaoqiang Lin; Rui Qiao; Chuan-Sheng Foo; Bryan Kian Hsiang Low

Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Chuan-Sheng Foo, Bryan Kian Hsiang Low

TL;DR

The paper tackles robustness in instance attribution for language model explanations, introducing the notion of $eta$-robustness and showing that Shapley-value attributions are more robust to dataset resampling than leave-one-out scores. To address the high cost of Shapley computation, it proposes FreeShap, a fine-tuning-free Shapley approximation based on empirical NTK kernel regression, with precomputation and submatrix reuse for scalability. Empirical results on SST-2, MR, MRPC, and RTE demonstrate that FreeShap closely tracks MC-Shapley and yields superior performance in data removal, data selection, and wrong-label detection, with successful extension to LLMs such as Llama2. The approach contributes practical tools for data-centric AI in NLP and provides theoretical guarantees on robustness, while acknowledging limitations to NLP and classification tasks and suggesting extensions to generation settings as future work.

Abstract

The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs). Our code is available at https://github.com/JTWang2000/FreeShap.

Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

TL;DR

The paper tackles robustness in instance attribution for language model explanations, introducing the notion of

-robustness and showing that Shapley-value attributions are more robust to dataset resampling than leave-one-out scores. To address the high cost of Shapley computation, it proposes FreeShap, a fine-tuning-free Shapley approximation based on empirical NTK kernel regression, with precomputation and submatrix reuse for scalability. Empirical results on SST-2, MR, MRPC, and RTE demonstrate that FreeShap closely tracks MC-Shapley and yields superior performance in data removal, data selection, and wrong-label detection, with successful extension to LLMs such as Llama2. The approach contributes practical tools for data-centric AI in NLP and provides theoretical guarantees on robustness, while acknowledging limitations to NLP and classification tasks and suggesting extensions to generation settings as future work.

Abstract

Paper Structure (44 sections, 2 theorems, 20 equations, 19 figures, 21 tables, 1 algorithm)

This paper contains 44 sections, 2 theorems, 20 equations, 19 figures, 21 tables, 1 algorithm.

Introduction
Background and Preliminaries
Prompt-based Fine-tuning
LOO and Shapley Value for Instance Attribution
Neural Tangent Kernel
Methodology
Definition of Robustness
Robustness of Instance Attribution
Fine-tuning-free Shapley Value (FreeShap)
Experiments and Results
FreeShap Approximates the Shapley Value Well
The Shapley Value is More Robust than LOO
Applications of Instance Attribution
Related work
Conclusion
...and 29 more sections

Key Result

Theorem 3.4

Let $\delta_k \coloneqq \text{Var}_{D_N \sim \mathcal{P}^{n-1}|z_i}(\Delta_{z_i}^{D_N}(k,D_T)) , \forall k \in \{0, \dots, n-1\}$. Shapley value is $\beta^{\text{Shap}}$-robust and LOO is $\beta^{\text{LOO}}$-robust where

Figures (19)

Figure 1: An example of non-robust instance attribution. The same training example receives different signs of the instance score when it is placed in different datasets sampled from the same task.
Figure 2: Mean and variance for instance scores of 10 examples when computed using LOO or the Shapley value.
Figure 3: Running time comparison. The time for 5k points for G-Shapley and 500/5k points for MC-Shapley are projected.
Figure 4: Data Removal: The test accuracy on models retrained on subsets obtained by iteratively removing 10% of the data, either from the highest or the lowest instance score. Faster degradation is preferable for high-score removals, while improvement or slower degradation is ideal for low-score removals. Overall, the scores from FreeShap are better correlated with test performance.
Figure 5: Wrong Label Detection: It shows the detected percentage of poisoned data when inspecting data from lowest to highest instance score. In most cases, FreeShap leads to the earliest identification of incorrectly labeled instances.
...and 14 more figures

Theorems & Definitions (7)

Definition 3.1: Expected marginal contribution
Definition 3.2: Consistently helpful/harmful data point
Definition 3.3: Robustness of instance attribution
Theorem 3.4: Robustness for Shapley value & LOO
Corollary 3.5: Robustness Analysis between the Shapley value and LOO
Remark 3.6: Relative relationship of expectation and variance between Shapley and LOO
proof

Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

TL;DR

Abstract

Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (7)