Table of Contents
Fetching ...

Do Influence Functions Work on Large Language Models?

Zhe Li, Wei Zhao, Yige Li, Jun Sun

TL;DR

This work interrogates whether influence functions can meaningfully attribute LLM outputs to individual training examples. Through a systematic empirical study across harmful-data identification, class attribution, and backdoor detection, the authors show that influence functions generally underperform compared to representation-based approaches like RepSim. They identify three fundamental causes: approximation errors in iHVP computations for large models, unstable convergence during fine-tuning, and a misalignment between parameter changes and actual behavioral changes in LLMs. The findings challenge prior successes and advocate for developing alternative, more reliable data-attribution methods to better understand and improve LLM behavior in practice.

Abstract

Influence functions are important for quantifying the impact of individual training data points on a model's predictions. Although extensive research has been conducted on influence functions in traditional machine learning models, their application to large language models (LLMs) has been limited. In this work, we conduct a systematic study to address a key question: do influence functions work on LLMs? Specifically, we evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings. Our further investigation reveals that their poor performance can be attributed to: (1) inevitable approximation errors when estimating the iHVP component due to the scale of LLMs, (2) uncertain convergence during fine-tuning, and, more fundamentally, (3) the definition itself, as changes in model parameters do not necessarily correlate with changes in LLM behavior. Thus, our study suggests the need for alternative approaches for identifying influential samples.

Do Influence Functions Work on Large Language Models?

TL;DR

This work interrogates whether influence functions can meaningfully attribute LLM outputs to individual training examples. Through a systematic empirical study across harmful-data identification, class attribution, and backdoor detection, the authors show that influence functions generally underperform compared to representation-based approaches like RepSim. They identify three fundamental causes: approximation errors in iHVP computations for large models, unstable convergence during fine-tuning, and a misalignment between parameter changes and actual behavioral changes in LLMs. The findings challenge prior successes and advocate for developing alternative, more reliable data-attribution methods to better understand and improve LLM behavior in practice.

Abstract

Influence functions are important for quantifying the impact of individual training data points on a model's predictions. Although extensive research has been conducted on influence functions in traditional machine learning models, their application to large language models (LLMs) has been limited. In this work, we conduct a systematic study to address a key question: do influence functions work on LLMs? Specifically, we evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings. Our further investigation reveals that their poor performance can be attributed to: (1) inevitable approximation errors when estimating the iHVP component due to the scale of LLMs, (2) uncertain convergence during fine-tuning, and, more fundamentally, (3) the definition itself, as changes in model parameters do not necessarily correlate with changes in LLM behavior. Thus, our study suggests the need for alternative approaches for identifying influential samples.
Paper Structure (17 sections, 1 theorem, 12 equations, 4 figures, 10 tables)

This paper contains 17 sections, 1 theorem, 12 equations, 4 figures, 10 tables.

Key Result

Theorem 1

Let $\mathbf{H}\in\mathbb{R}^{n\times n}$ be the Hessian matrix of the model, and $\lambda>0$ be the damping coefficient. When the rank of $H$ satisfies $\text{rank}(\mathbf{H})\ll n$, the inverse $(\mathbf{H}+\lambda\mathbf{I})^{-1}$ is close to $\mathbf{I}/\lambda$ and the approximation error is a

Figures (4)

  • Figure 1: One showcase of the most influential training data identified by various methods according to the validation example. Important keywords are manually highlighted for clarity.
  • Figure 2: Performance comparison of Llama2-7b and Mistral-7b fine-tuned using the full training dataset and influential subsets selected by different methods. Higher accuracy means better performance on the validation set.
  • Figure 3: Simulated $\mathbf{H}\in\mathbb{R}^{n\times n}$ and $\mathbf{H}+\lambda\mathbf{I}$ with $n=128,512,2048$ and $\lambda=0.1$. All the matrices are normalized for better visualization.
  • Figure 4: Top: Changes of accuracy of the Hessian-based (DataInf) and Hessian-free methods with model convergence during fine-tuning on different datasets. Bottom: Changes in parameters ($\Delta\theta$) during fine-tuning Llama2-7b on different datasets.

Theorems & Definitions (1)

  • Theorem 1