GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection
Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi W. Ma
TL;DR
The paper addresses the scalability of gradient-based data attribution by exploiting per-sample gradient sparsity to achieve sub-linear complexity. It introduces GraSS, a two-stage compression that sparsifies gradients and then applies a sparse projection (SJLT), and FactGraSS, which extends this idea to linear layers using gradient factorization (Kronecker structure) to avoid materializing full gradients. Together, these methods achieve substantial speedups over prior baselines while preserving attribution fidelity, with reports of up to 165% faster throughput on billion-scale models and successful application to GPT2-small and Llama-3.1-8B-Instruct, illustrating practical benefits for large-scale data attribution. The work broadens the feasibility of robust data attribution in real-world, large-model settings by reducing memory and compute bottlenecks without sacrificing accuracy.
Abstract
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
