Table of Contents
Fetching ...

Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration

Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng

TL;DR

A novel TDA method called Debias and Denoise Attribution (DDA) is introduced, which enhances influence functions by addressing fitting errors and exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.

Abstract

The black-box nature of large language models (LLMs) poses challenges in interpreting results, impacting issues such as data intellectual property protection and hallucination tracing. Training data attribution (TDA) methods are considered effective solutions to address these challenges. Most recent TDA methods rely on influence functions, assuming the model achieves minimized empirical risk. However, achieving this criterion is difficult, and sourcing accuracy can be compromised by fitting errors during model training. In this paper, we introduce a novel TDA method called Debias and Denoise Attribution (DDA), which enhances influence functions by addressing fitting errors. Specifically, the debias strategy seeks to improve the performance of influence functions by eliminating the knowledge bias present in the base model before fine-tuning, while the denoise strategy aims to reduce discrepancies in influence scores arising from varying degrees of fitting during the training process through smoothing techniques. Experimental results demonstrate that our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%. Moreover, DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.

Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration

TL;DR

A novel TDA method called Debias and Denoise Attribution (DDA) is introduced, which enhances influence functions by addressing fitting errors and exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.

Abstract

The black-box nature of large language models (LLMs) poses challenges in interpreting results, impacting issues such as data intellectual property protection and hallucination tracing. Training data attribution (TDA) methods are considered effective solutions to address these challenges. Most recent TDA methods rely on influence functions, assuming the model achieves minimized empirical risk. However, achieving this criterion is difficult, and sourcing accuracy can be compromised by fitting errors during model training. In this paper, we introduce a novel TDA method called Debias and Denoise Attribution (DDA), which enhances influence functions by addressing fitting errors. Specifically, the debias strategy seeks to improve the performance of influence functions by eliminating the knowledge bias present in the base model before fine-tuning, while the denoise strategy aims to reduce discrepancies in influence scores arising from varying degrees of fitting during the training process through smoothing techniques. Experimental results demonstrate that our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%. Moreover, DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
Paper Structure (28 sections, 24 equations, 2 figures, 2 tables)

This paper contains 28 sections, 24 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The TDA results on England-China hallucination data across LLMs with varying parameter scales, evaluated using DDA. Specifically, we perform TDA on three different parameter configurations of the Qwen2 model, using the same dataset, training hyperparameters, and training framework.
  • Figure 2: At various values of the debias coefficient $\beta$, the TDA results of DDA. We select the checkpoint at epoch 1 training of LLaMA2-7B-Chat using England-China hallucination data. Our findings indicate that as the debias coefficient increases, the TDA capability of DDA gradually stabilizes.