Table of Contents
Fetching ...

Provable Separations between Memorization and Generalization in Diffusion Models

Zeqi Ye, Qijie Zhu, Molei Tao, Minshuo Chen

TL;DR

The paper addresses memorization in diffusion models by developing a dual separation framework: statistical, showing a nonzero gap between the ground-truth and empirical score functions via Fisher divergence, and architectural, proving the ground-truth score admits a compact neural representation while the empirical score requires network size that grows with the sample size. It quantifies the loss gap under mixture distributions and demonstrates how small diffusion times and data variance amplify memorization. Guided by theory, it proposes a pruning-based mitigation for diffusion transformers that reduces memorization while preserving sample quality, with empirical validation on Gaussian mixtures and CIFAR-10. The work provides a principled lens to understand memorization and offers practical, theory-informed strategies to improve generalization in diffusion-based generation.

Abstract

Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization -- reproducing training data rather than generating novel outputs. This not only limits their creative potential but also raises concerns about privacy and safety. While empirical studies have explored mitigation strategies, theoretical understanding of memorization remains limited. We address this gap through developing a dual-separation result via two complementary perspectives: statistical estimation and network approximation. From the estimation side, we show that the ground-truth score function does not minimize the empirical denoising loss, creating a separation that drives memorization. From the approximation side, we prove that implementing the empirical score function requires network size to scale with sample size, spelling a separation compared to the more compact network representation of the ground-truth score function. Guided by these insights, we develop a pruning-based method that reduces memorization while maintaining generation quality in diffusion transformers.

Provable Separations between Memorization and Generalization in Diffusion Models

TL;DR

The paper addresses memorization in diffusion models by developing a dual separation framework: statistical, showing a nonzero gap between the ground-truth and empirical score functions via Fisher divergence, and architectural, proving the ground-truth score admits a compact neural representation while the empirical score requires network size that grows with the sample size. It quantifies the loss gap under mixture distributions and demonstrates how small diffusion times and data variance amplify memorization. Guided by theory, it proposes a pruning-based mitigation for diffusion transformers that reduces memorization while preserving sample quality, with empirical validation on Gaussian mixtures and CIFAR-10. The work provides a principled lens to understand memorization and offers practical, theory-informed strategies to improve generalization in diffusion-based generation.

Abstract

Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization -- reproducing training data rather than generating novel outputs. This not only limits their creative potential but also raises concerns about privacy and safety. While empirical studies have explored mitigation strategies, theoretical understanding of memorization remains limited. We address this gap through developing a dual-separation result via two complementary perspectives: statistical estimation and network approximation. From the estimation side, we show that the ground-truth score function does not minimize the empirical denoising loss, creating a separation that drives memorization. From the approximation side, we prove that implementing the empirical score function requires network size to scale with sample size, spelling a separation compared to the more compact network representation of the ground-truth score function. Guided by these insights, we develop a pruning-based method that reduces memorization while maintaining generation quality in diffusion transformers.

Paper Structure

This paper contains 51 sections, 32 theorems, 264 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Proposition 4.1

For any time $t \leq T$, it holds that where the divergence ${\tt Fisher}(\widehat{P}_t, P_t) = \mathbb{E}_{X \sim \widehat{P}_t} [\|\nabla \log \widehat{p}_t(X) - \nabla \log p_t(X)\|_2^2]$.

Figures (4)

  • Figure 1: Smaller $t$ leads to larger $\texttt{Loss-Gap}_t$. When sample size $n$ is not sufficiently large, the gap is non-negligible.
  • Figure 2: Learning 2D Gaussian mixture with varying network sizes. Increasing the network size leads to a clear progression: from failing to capture the underlying distribution, to partial generalization, and eventually to memorization. Memorized samples generated by the largest network are highlighted in red.
  • Figure 3: Comparison of experimental results on Gaussian mixture data. In (b), solid lines show memorization ratio, dashed lines show mean log-likelihood.
  • Figure 4: Left: Generated images from the same random noise, with the original model (top) and our pruned model (bottom). Right: Nearest neighbors of the generated images in the CIFAR-10 training set. At a comparable level of quality, the pruned model shows greater diversity, while the original model tends to replicate training samples.

Theorems & Definitions (51)

  • Definition 3.1: Hölder norm
  • Definition 3.2: Sub-Gaussian Hölder density
  • Proposition 4.1
  • Theorem 4.3: Lower bound on $\texttt{Loss-Gap}_t$
  • Theorem 5.1
  • Lemma 5.2
  • proof
  • Lemma A.1: Laurent-Massart bound for $\chi^2$ concentration laurent2000adaptive
  • Lemma A.2: Norm Concentration of $\epsilon$
  • Corollary A.3: Sample Separation and Norm Control
  • ...and 41 more