Table of Contents
Fetching ...

Diffusion Models Generalize but Not in the Way You Might Think

Tim Kaiser, Markus Kollmann

Abstract

Standard evaluation metrics suggest that Denoising Diffusion Models based on U-Net or Transformer architectures generalize well in practice. However, as it can be shown that an optimal Diffusion Model fully memorizes the training data, the model error determines generalization. Here, we show that although sufficiently large denoiser models show increasing memorization of the training set with increasing training time, the resulting denoising trajectories do not follow this trend. Our experiments indicate that the reason for this observation is rooted in the fact that overfitting occurs at intermediate noise levels, but the distribution of noisy training data at these noise levels has little overlap with denoising trajectories during inference. To gain more insight, we make use of a 2D toy diffusion model to show that overfitting at intermediate noise levels is largely determined by model error and the density of the data support. While the optimal denoising flow field localizes sharply around training samples, sufficient model error or dense support on the data manifold suppresses exact recall, yielding a smooth, generalizing flow field. To further support our results, we investigate how several factors, such as training time, model size, dataset size, condition granularity, and diffusion guidance, influence generalization behavior.

Diffusion Models Generalize but Not in the Way You Might Think

Abstract

Standard evaluation metrics suggest that Denoising Diffusion Models based on U-Net or Transformer architectures generalize well in practice. However, as it can be shown that an optimal Diffusion Model fully memorizes the training data, the model error determines generalization. Here, we show that although sufficiently large denoiser models show increasing memorization of the training set with increasing training time, the resulting denoising trajectories do not follow this trend. Our experiments indicate that the reason for this observation is rooted in the fact that overfitting occurs at intermediate noise levels, but the distribution of noisy training data at these noise levels has little overlap with denoising trajectories during inference. To gain more insight, we make use of a 2D toy diffusion model to show that overfitting at intermediate noise levels is largely determined by model error and the density of the data support. While the optimal denoising flow field localizes sharply around training samples, sufficient model error or dense support on the data manifold suppresses exact recall, yielding a smooth, generalizing flow field. To further support our results, we investigate how several factors, such as training time, model size, dataset size, condition granularity, and diffusion guidance, influence generalization behavior.
Paper Structure (20 sections, 7 equations, 31 figures, 2 tables)

This paper contains 20 sections, 7 equations, 31 figures, 2 tables.

Figures (31)

  • Figure 1: Generalization gap and overfitting for $E_{\cal D}(\sigma)$ at $\sigma \approx 1.67$. We define overfitting as a decrease in validation performance while training performance improves.
  • Figure 2: The relative generalization gap $(E_{val} - E_{train})/E_{train}$ increases with images seen during training (colorbar in millions) and with model size in relation to dataset size. The black line indicates the beginning of overfitting (validation error starts increasing, \ref{['fig:overfit-illustration']}) for each noise level $\sigma$.
  • Figure 3: Flow field lines and predictions of the optimal target predictor $\bm y^*(\bm x, \sigma)$ at different noise levels $\sigma = 28, 2.8, 0.63$. Color indicates the magnitude of the prediction error $\bm y^*(\bm x, \sigma) - x$. Field geometry is (a) global, predictions tend towards the superposition of all training points, (b) moderately localized around training points, predictions approximate the data manifold, and (c) highly localized around each training point, predictions replicate the training points.
  • Figure 4: Relative generalization gap $(E_{val} - E_{train})/E_{train}$ in our 2D toy model for different settings of the model error parameter $\delta$ (colorbar).
  • Figure 5: Filled contours showing the relative generalization gap $(E- E_{train})/E_{train}$ at intermediate noise ($\sigma \approx 1.1$) between the training points and each point in space, respectively. The black lines enclose the area where this gap is $0.5$ or less. As the model error $\delta$ decreases, this region shrinks until it no longer contains the validation points, indicating that the generalization gap to the validation set is larger than 0.5.
  • ...and 26 more figures