Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions
Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Chuan-Sheng Foo, Bryan Kian Hsiang Low
TL;DR
The paper tackles robustness in instance attribution for language model explanations, introducing the notion of $eta$-robustness and showing that Shapley-value attributions are more robust to dataset resampling than leave-one-out scores. To address the high cost of Shapley computation, it proposes FreeShap, a fine-tuning-free Shapley approximation based on empirical NTK kernel regression, with precomputation and submatrix reuse for scalability. Empirical results on SST-2, MR, MRPC, and RTE demonstrate that FreeShap closely tracks MC-Shapley and yields superior performance in data removal, data selection, and wrong-label detection, with successful extension to LLMs such as Llama2. The approach contributes practical tools for data-centric AI in NLP and provides theoretical guarantees on robustness, while acknowledging limitations to NLP and classification tasks and suggesting extensions to generation settings as future work.
Abstract
The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs). Our code is available at https://github.com/JTWang2000/FreeShap.
