DMin: Scalable Training Data Influence Estimation for Diffusion Models
Huawei Lin, Yingjie Lao, Weijie Zhao
TL;DR
This work addresses the scalability gap in training data influence estimation for diffusion models with billions of parameters. It introduces DMin, a gradient-compression framework that caches per-sample gradients across diffusion timesteps, applies $L_2$ normalization, and compresses them via padding, permutation, random projection, and group summation, enabling exact inner-product influence scores or fast $k$-nearest-neighbor retrieval with negligible storage overhead. The approach dramatically reduces per-sample storage from TBs to MBs/KBs and delivers sub-second top-$k$ retrieval, validated on both conditional and unconditional diffusion models with substantial performance gains over baselines. The results demonstrate accurate identification of influential training samples while achieving orders-of-magnitude improvements in time and memory, making scalable influence estimation feasible for state-of-the-art, large-scale diffusion models. The work also provides an open-source PyTorch implementation, broadening practical adoption and enabling transparency and data provenance in diffusion-model outputs.
Abstract
Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models (DMs), yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. To the best of our knowledge, it is the first method capable of influence estimation for DMs with billions of parameters. Leveraging efficient gradient compression, DMin reduces storage requirements from hundreds of TBs to mere MBs or even KBs, and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.
