Table of Contents
Fetching ...

Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Chuan-Sheng Foo, Bryan Kian Hsiang Low

TL;DR

The paper tackles robustness in instance attribution for language model explanations, introducing the notion of $eta$-robustness and showing that Shapley-value attributions are more robust to dataset resampling than leave-one-out scores. To address the high cost of Shapley computation, it proposes FreeShap, a fine-tuning-free Shapley approximation based on empirical NTK kernel regression, with precomputation and submatrix reuse for scalability. Empirical results on SST-2, MR, MRPC, and RTE demonstrate that FreeShap closely tracks MC-Shapley and yields superior performance in data removal, data selection, and wrong-label detection, with successful extension to LLMs such as Llama2. The approach contributes practical tools for data-centric AI in NLP and provides theoretical guarantees on robustness, while acknowledging limitations to NLP and classification tasks and suggesting extensions to generation settings as future work.

Abstract

The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs). Our code is available at https://github.com/JTWang2000/FreeShap.

Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

TL;DR

The paper tackles robustness in instance attribution for language model explanations, introducing the notion of -robustness and showing that Shapley-value attributions are more robust to dataset resampling than leave-one-out scores. To address the high cost of Shapley computation, it proposes FreeShap, a fine-tuning-free Shapley approximation based on empirical NTK kernel regression, with precomputation and submatrix reuse for scalability. Empirical results on SST-2, MR, MRPC, and RTE demonstrate that FreeShap closely tracks MC-Shapley and yields superior performance in data removal, data selection, and wrong-label detection, with successful extension to LLMs such as Llama2. The approach contributes practical tools for data-centric AI in NLP and provides theoretical guarantees on robustness, while acknowledging limitations to NLP and classification tasks and suggesting extensions to generation settings as future work.

Abstract

The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs). Our code is available at https://github.com/JTWang2000/FreeShap.
Paper Structure (44 sections, 2 theorems, 20 equations, 19 figures, 21 tables, 1 algorithm)

This paper contains 44 sections, 2 theorems, 20 equations, 19 figures, 21 tables, 1 algorithm.

Key Result

Theorem 3.4

Let $\delta_k \coloneqq \text{Var}_{D_N \sim \mathcal{P}^{n-1}|z_i}(\Delta_{z_i}^{D_N}(k,D_T)) , \forall k \in \{0, \dots, n-1\}$. Shapley value is $\beta^{\text{Shap}}$-robust and LOO is $\beta^{\text{LOO}}$-robust where

Figures (19)

  • Figure 1: An example of non-robust instance attribution. The same training example receives different signs of the instance score when it is placed in different datasets sampled from the same task.
  • Figure 2: Mean and variance for instance scores of 10 examples when computed using LOO or the Shapley value.
  • Figure 3: Running time comparison. The time for 5k points for G-Shapley and 500/5k points for MC-Shapley are projected.
  • Figure 4: Data Removal: The test accuracy on models retrained on subsets obtained by iteratively removing 10% of the data, either from the highest or the lowest instance score. Faster degradation is preferable for high-score removals, while improvement or slower degradation is ideal for low-score removals. Overall, the scores from FreeShap are better correlated with test performance.
  • Figure 5: Wrong Label Detection: It shows the detected percentage of poisoned data when inspecting data from lowest to highest instance score. In most cases, FreeShap leads to the earliest identification of incorrectly labeled instances.
  • ...and 14 more figures

Theorems & Definitions (7)

  • Definition 3.1: Expected marginal contribution
  • Definition 3.2: Consistently helpful/harmful data point
  • Definition 3.3: Robustness of instance attribution
  • Theorem 3.4: Robustness for Shapley value & LOO
  • Corollary 3.5: Robustness Analysis between the Shapley value and LOO
  • Remark 3.6: Relative relationship of expectation and variance between Shapley and LOO
  • proof