Data Shapley in One Training Run
Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia
TL;DR
This paper replaces computationally prohibitive retraining-based Data Shapley with In-Run Data Shapley, a model-specific attribution method that tracks data value across a single training run. By framing per-iteration utility U^{(t)} and applying first- and second-order Taylor expansions, it yields closed-form Shapley contributions computable with minimal overhead through ghost-dot-product and ghost gradient-Hessian techniques. Empirical results on GPT-2 and the Pile dataset show near-regular training speed for first-order attribution and substantial fidelity to ground-truth Shapley estimates, enabling practical data curation and copyright-aware analysis at foundation-model scales. Case studies reveal data quality issues, stage-dependent data importance, and significant implications for data-tracing and royalty considerations in generative AI.
Abstract
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.
