Table of Contents
Fetching ...

DMin: Scalable Training Data Influence Estimation for Diffusion Models

Huawei Lin, Yingjie Lao, Weijie Zhao

TL;DR

This work addresses the scalability gap in training data influence estimation for diffusion models with billions of parameters. It introduces DMin, a gradient-compression framework that caches per-sample gradients across diffusion timesteps, applies $L_2$ normalization, and compresses them via padding, permutation, random projection, and group summation, enabling exact inner-product influence scores or fast $k$-nearest-neighbor retrieval with negligible storage overhead. The approach dramatically reduces per-sample storage from TBs to MBs/KBs and delivers sub-second top-$k$ retrieval, validated on both conditional and unconditional diffusion models with substantial performance gains over baselines. The results demonstrate accurate identification of influential training samples while achieving orders-of-magnitude improvements in time and memory, making scalable influence estimation feasible for state-of-the-art, large-scale diffusion models. The work also provides an open-source PyTorch implementation, broadening practical adoption and enabling transparency and data provenance in diffusion-model outputs.

Abstract

Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models (DMs), yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. To the best of our knowledge, it is the first method capable of influence estimation for DMs with billions of parameters. Leveraging efficient gradient compression, DMin reduces storage requirements from hundreds of TBs to mere MBs or even KBs, and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.

DMin: Scalable Training Data Influence Estimation for Diffusion Models

TL;DR

This work addresses the scalability gap in training data influence estimation for diffusion models with billions of parameters. It introduces DMin, a gradient-compression framework that caches per-sample gradients across diffusion timesteps, applies normalization, and compresses them via padding, permutation, random projection, and group summation, enabling exact inner-product influence scores or fast -nearest-neighbor retrieval with negligible storage overhead. The approach dramatically reduces per-sample storage from TBs to MBs/KBs and delivers sub-second top- retrieval, validated on both conditional and unconditional diffusion models with substantial performance gains over baselines. The results demonstrate accurate identification of influential training samples while achieving orders-of-magnitude improvements in time and memory, making scalable influence estimation feasible for state-of-the-art, large-scale diffusion models. The work also provides an open-source PyTorch implementation, broadening practical adoption and enabling transparency and data provenance in diffusion-model outputs.

Abstract

Identifying the training data samples that most influence a generated image is a critical task in understanding diffusion models (DMs), yet existing influence estimation methods are constrained to small-scale or LoRA-tuned models due to computational limitations. To address this challenge, we propose DMin (Diffusion Model influence), a scalable framework for estimating the influence of each training data sample on a given generated image. To the best of our knowledge, it is the first method capable of influence estimation for DMs with billions of parameters. Leveraging efficient gradient compression, DMin reduces storage requirements from hundreds of TBs to mere MBs or even KBs, and retrieves the top-k most influential training samples in under 1 second, all while maintaining performance. Our empirical results demonstrate DMin is both effective in identifying influential training samples and efficient in terms of computational and storage requirements.

Paper Structure

This paper contains 17 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Examples of influential training samples, with prompts displayed below generated image. (SD 3 Medium with LoRA, $v=2^{16}$).
  • Figure 2: Overview of the proposed DMin. (a) In gradient computation, given a training data sample (a pair of prompt $p^i$ and image $x^i$) and a timestep $t$, the data passes through the diffusion model in the same manner as during training. After the backward pass, the gradients $g^i_t$ at timestep $t$ can be obtained. (b) For the full model, gradients are collected from the UNet or transformer, whereas for models with adapters, such as LoRA, gradients are collected only from the adapter. (c) For a prompt $p^s$ and the corresponding generated image $x^s$, the gradients are obtained in the same way as in Gradient Computation. The influence $\mathcal{I}_\theta(X^s,X^i)$ is then estimated by aggregating gradients across timesteps from $t = 1$ to $T$. (d) In some cases, only the most influential data samples are needed; in such instances, KNN can be utilized to retrieve the top-$k$ most influential samples within seconds.
  • Figure 3: Examples of generated images alongside the most and least influential samples (from left to right) as estimated by DMin for unconditional DDPM models on the MNIST and CIFAR-10 datasets.
  • Figure 4: Examples of each dataset used in experiments.
  • Figure 5: Additional visualization for unconditional diffusion model on the MNIST dataset.
  • ...and 1 more figures