Table of Contents
Fetching ...

Data Shapley in One Training Run

Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

TL;DR

This paper replaces computationally prohibitive retraining-based Data Shapley with In-Run Data Shapley, a model-specific attribution method that tracks data value across a single training run. By framing per-iteration utility U^{(t)} and applying first- and second-order Taylor expansions, it yields closed-form Shapley contributions computable with minimal overhead through ghost-dot-product and ghost gradient-Hessian techniques. Empirical results on GPT-2 and the Pile dataset show near-regular training speed for first-order attribution and substantial fidelity to ground-truth Shapley estimates, enabling practical data curation and copyright-aware analysis at foundation-model scales. Case studies reveal data quality issues, stage-dependent data importance, and significant implications for data-tracing and royalty considerations in generative AI.

Abstract

Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

Data Shapley in One Training Run

TL;DR

This paper replaces computationally prohibitive retraining-based Data Shapley with In-Run Data Shapley, a model-specific attribution method that tracks data value across a single training run. By framing per-iteration utility U^{(t)} and applying first- and second-order Taylor expansions, it yields closed-form Shapley contributions computable with minimal overhead through ghost-dot-product and ghost gradient-Hessian techniques. Empirical results on GPT-2 and the Pile dataset show near-regular training speed for first-order attribution and substantial fidelity to ground-truth Shapley estimates, enabling practical data curation and copyright-aware analysis at foundation-model scales. Case studies reveal data quality issues, stage-dependent data importance, and significant implications for data-tracing and royalty considerations in generative AI.

Abstract

Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.
Paper Structure (40 sections, 5 theorems, 32 equations, 15 figures, 5 tables)

This paper contains 40 sections, 5 theorems, 32 equations, 15 figures, 5 tables.

Key Result

Theorem 2

For any of two utility functions $U_1, U_2$ and any $\alpha_1, \alpha_2 \in \mathbb{R}$, we have $\phi_{z} \left( \alpha_{1} U_{1}+\alpha_{2} U_{2}\right)=\alpha_{1} \phi_z\left(U_{1}\right)+$$\alpha_{2} \phi_z\left( U_{2}\right)$.

Figures (15)

  • Figure 1: Comparison between the Monte Carlo-estimated In-Run Data Shapley and First/Second-order In-Run Data Shapley.
  • Figure 2: Test loss comparison between the original training run and the model trained on the cleaned subset according to different data attribution techniques.
  • Figure 3: Left: Domain value composition for a corpus of math text. Right: The math corpus we use as the validation data for attribution, and examples of high- and low-valued training corpus for it.
  • Figure 4: The core idea and algorithm overview of In-Run Data Shapley. Rather than evaluating data contribution across the entire training process (top), we decompose it into individual gradient update steps (bottom). For each iteration $t$, we compute the Shapley value $\phi_z(U^{(t)})$ with respect to a "local utility function" $U^{(t)}$ that measures how the batch $\mathcal{B}_t$ contributes to reducing validation loss. A data point's final contribution score is the sum of its Shapley values across all iterations where it appears: $\phi_z = \sum_{t:z\in \mathcal{B}_t} \phi_z(U^{(t)})$. This decomposition approach maintains the Shapley value properties through the linearity axiom while making computation tractable.
  • Figure 5: Comparison between Retraining-based and In-Run Data Shapley. Retraining-based Data Shapley requires training a model from scratch on all possible subsets of the full training set, which is computationally inefficient and raises concerns about interpretability and stability. In contrast, In-Run Data Shapley acts as a "contribution accountant", efficiently tracking and attributing data value scores to each training example across gradient update steps during a single training run.
  • ...and 10 more figures

Theorems & Definitions (15)

  • Definition 1: Shapley value shapley1953value
  • Theorem 2: Linearity of the Shapley value shapley1953value
  • Remark 1: Multiple validation points
  • Theorem 3
  • Theorem 4
  • Remark 2
  • Remark 3
  • Remark 4: In-run Data Shapley is a model-specific data attribution technique.
  • Theorem 5: Restate of Theorem \ref{['thm:firstorder']}
  • proof
  • ...and 5 more