Table of Contents
Fetching ...

Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training

Tony Bonnaire, Raphaël Urfin, Giulio Biroli, Marc Mézard

TL;DR

The paper tackles why diffusion and score-based generative models avoid memorizing training data while still achieving strong performance. It uncovers two well-separated training timescales, $\tau_{\\mathrm{gen}}$ and $\\tau_{\\mathrm{mem}}$, with $\\tau_{\\mathrm{mem}}$ growing roughly linearly with the dataset size $n$, producing a growing generalization window as $n$ increases. Through extensive experiments on CelebA with U-Nets and a theoretically tractable Random Features Network, the authors show that the memorization phase is an intrinsic dynamical phenomenon rather than a simple consequence of repeated data exposure, and they connect the two timescales to distinct spectral bulks in the RFN analysis. The work highlights implicit dynamical regularization as a core mechanism that promotes generalization in overparameterized diffusion models and provides practical guidance (early stopping, capacity control) to avoid memorization in practice. Overall, it offers a unified framework linking training dynamics, spectral properties, and generalization behavior across score-based generative models.

Abstract

Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $τ_\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $τ_\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $τ_\mathrm{mem}$ increases linearly with the training set size $n$, while $τ_\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training

TL;DR

The paper tackles why diffusion and score-based generative models avoid memorizing training data while still achieving strong performance. It uncovers two well-separated training timescales, and , with growing roughly linearly with the dataset size , producing a growing generalization window as increases. Through extensive experiments on CelebA with U-Nets and a theoretically tractable Random Features Network, the authors show that the memorization phase is an intrinsic dynamical phenomenon rather than a simple consequence of repeated data exposure, and they connect the two timescales to distinct spectral bulks in the RFN analysis. The work highlights implicit dynamical regularization as a core mechanism that promotes generalization in overparameterized diffusion models and provides practical guidance (early stopping, capacity control) to avoid memorization in practice. Overall, it offers a unified framework linking training dynamics, spectral properties, and generalization behavior across score-based generative models.

Abstract

Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time at which models begin to generate high-quality samples, and a later time beyond which memorization emerges. Crucially, we find that increases linearly with the training set size , while remains constant. This creates a growing window of training times with where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

Paper Structure

This paper contains 20 sections, 11 theorems, 150 equations, 15 figures.

Key Result

Theorem 3.1

Let $q(z)=\frac{1}{p}\mathop{\mathrm{Tr}}\nolimits(\bm{\mathrm{U}}-z\bm{I}_p)^{-1}$, $r(z)=\frac{1}{p}\mathop{\mathrm{Tr}}\nolimits(\boldsymbol{\Sigma}^{1/2}\bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z\bm{I}_p)^{-1}\bm{\mathrm{W}}\boldsymbol{\Sigma}^{1/2})$ and $s(z)=\frac{1}{p}\mathop{\mathrm{Tr}}\nolimits( Then $q(z), r(z)$ and $s(z)$ satisfy the following set of three equations: The eigenvalue distribu

Figures (15)

  • Figure 1: Qualitative summary of our contributions.(Left) Illustration of the training dynamics of a diffusion model. Depending on the training time $\tau$, we identify three regimes measured by the inverse quality of the generated samples (blue curve) and their memorization fraction (red curve). The generalization regime extends over a large window of training times which increases with the training set size $n$. On top, we show a one dimensional example of the learned score function during training (orange). The gray line gives the exact empirical score, at a given noise level, while the black dashed line corresponds to the true (population) score. (Right) Phase diagram in the $(n,p)$ plane illustrating three regimes of diffusion models: Memorization when $n$ is sufficiently small at fixed $p$, Architectural Regularization for $n>n^{\star}(p)$ (which is model and dataset dependent, as discussed in george_2025Kamb2024), and Dynamical Regularization, corresponding to a large intermediate generalization regime obtained when the training dynamics is stopped early, i.e. $\tau \in \left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right]$.
  • Figure 2: Memorization transition as a function of the training set size $n$ for U-Net score models on CelebA.(Left) FID (solid lines, left axis) and memorization fraction $f_\mathrm{mem}$ (dashed lines, right axis) against training time $\tau$ for various $n$. Inset: normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ with the rescaled time $\tau/n$. (Middle) Training (solid lines) and test (dashed lines) loss with $\tau$ for several $n$ at fixed $t=0.01$. Inset: both losses plotted against $\tau/n$. Error bars on the losses are imperceptible. (Right) Generated samples from the model trained with $n=1024$ for $\tau=100$K or $\tau=1.62$M steps, along with their nearest neighbors in the training set.
  • Figure 3: Effect of the number of parameters in the U-Net architecture on the timescales of the training dynamics.(Left) FID (panels A, B) and normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ (panels C, D) for various $n$ and $W$ during training. In panels B and D, time is rescaled such that all curves collapse. (Right)$(n,p)$ phase diagram of generalization vs memorization for U-Nets trained on CelebA. Curves show, for $\tau \in \{\tau_\mathrm{gen}, 3\tau_\mathrm{gen}, 8\tau_\mathrm{gen}\}$, the minimal dataset size $n(p)$ satisfying $f_\mathrm{mem}(\tau)=0$. The shaded background indicates the memorization--generalization boundary for $\tau=\tau_\mathrm{gen}$.
  • Figure 4: (Left) Illustration of an RFNN. (Middle/Right) Spectrum of $\bm{\mathrm{U}}$. Density $\rho(\lambda)$ from Theorem \ref{['thm:Saddle_point_equations_new']} in the overparameterized Regime I described in Theorem \ref{['thm:Spectrum_new']}, with $\psi_p = 64$, $\psi_n = 8$, $t = 0.01$, and $\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1)$. The bulk of the spectrum (orange) is between $\lambda\approx10$ and $\lambda\approx45$. The histogram shows the eigenvalues from a single realization of $\bm{\mathrm{U}}$ at $d = 100$. Inset: zoom near $\lambda = 0$ (in blue) showing the first bulk $\rho_1$ and the delta peak at $\lambda = s_t^2$. (Right) Same as (Middle), but with $\rho_{\boldsymbol{\Sigma}}(\lambda) = \frac{1}{2}\delta(\lambda - 0.5) + \frac{1}{2}\delta(\lambda - 1.5)$. The first bulk in blue remains unchanged, as it depends only on $\sigma_{\bm{\mathrm{x}}}^2 = \mathop{\mathrm{Tr}}\nolimits(\boldsymbol{\Sigma})/d = 1$ in both cases, while the second bulk varies with $\boldsymbol{\Sigma}$.
  • Figure 5: Evolution of the training and test losses for the RFNN. (A) Distance to the true score $\mathcal{E}_\mathrm{score}$ against training time $\tau$ for $\psi_n=4,8,16,32$,$\psi_p=64, t=0.1$ and $d=100$. In the inset, the training time is rescaled by $\tau_\mathrm{mem}=\psi_p/\Delta_t\lambda_\mathrm{min}$. (B) Training (solid) and test (dashed) losses for various $\psi_n$. The inset shows both losses rescaled by $\tau_\mathrm{mem}$. (C) Heatmaps of $\mathcal{L}_\mathrm{gen}$ for $\tau=10^{3}$ (top) and $\tau=10^4$ (bottom) as a function of $\psi_n$ and $\psi_p$. All the curves use Pytorch pytorch_2019 gradient descent. More numerical details can be found in SM Sect. \ref{['appendix:Num_exp_RF']}.
  • ...and 10 more figures

Theorems & Definitions (20)

  • Theorem 3.1
  • Theorem 3.2: Informal
  • Proposition C.1
  • proof
  • Lemma C.1: Gaussian Equivalence Principle for $\bm{\mathrm{U}}$
  • proof
  • Lemma C.2: GEP for $\tilde{U}$
  • proof
  • Lemma C.3: Scaling of the bulk of $\tilde{\bm{\mathrm{U}}}$
  • proof
  • ...and 10 more