Table of Contents
Fetching ...

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, Eric Xing

TL;DR

This work tackles the challenge of crediting data providers for LLM training by scaling influence-function-based data valuation. It introduces LoGra, a memory- and compute-efficient gradient projection method that leverages Kronecker-structured projections to dramatically reduce iHVP and gradient costs, and Logix, a software platform for turning training code into data-valuation code. The authors provide theoretical justification for gradient projection within influence functions and demonstrate substantial throughput (up to 6,500x) and memory savings on billion-scale models, with competitive valuation accuracy. Collectively, this framework enables practical, scalable data valuation for large language models and lays groundwork for data crediting and compensation mechanisms.

Abstract

Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

TL;DR

This work tackles the challenge of crediting data providers for LLM training by scaling influence-function-based data valuation. It introduces LoGra, a memory- and compute-efficient gradient projection method that leverages Kronecker-structured projections to dramatically reduce iHVP and gradient costs, and Logix, a software platform for turning training code into data-valuation code. The authors provide theoretical justification for gradient projection within influence functions and demonstrate substantial throughput (up to 6,500x) and memory savings on billion-scale models, with competitive valuation accuracy. Collectively, this framework enables practical, scalable data valuation for large language models and lays groundwork for data crediting and compensation mechanisms.

Abstract

Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.
Paper Structure (45 sections, 1 theorem, 10 equations, 20 figures, 2 tables)

This paper contains 45 sections, 1 theorem, 10 equations, 20 figures, 2 tables.

Key Result

Lemma 1

Let $\{e_1,\cdots,e_n\}$ and $\{\lambda_1,\cdots,\lambda_n\}$ be eigenvectors and eigenvalues of the Hessian $H$. Expressing $g_{tr/te} =\sum_ic_{tr/te,i}\cdot(\sqrt{\lambda_i}e_i)$, the following holds under Assumption eq:assumption1:

Figures (20)

  • Figure 1: Data valuation system architecture. (Left Bottom) We first extract the Hessian and gradients for all training data using efficient gradient projection LoGra and store them in a database. (Left Top) At test time, we similarly extract gradients and query the database. (Right) The database returns similarity scores with respect to training examples that can be used for data valuation/attribution.
  • Figure 2: LoGra.
  • Figure 3: Code Example of Logix.
  • Figure 4: Quantitative accuracy evaluation of data valuation algorithms. We excluded TRAK in the WikiText experiments due to lack of a public implementation for language modeling tasks.
  • Figure 5: Qualitative accuracy of data valuations with LoGra. Important keywords in each example are manually highlighted for the improved readability. More examples can be found in Appendix \ref{['sec:qualitative']}.
  • ...and 15 more figures

Theorems & Definitions (1)

  • Lemma 1