Table of Contents
Fetching ...

Generalization in diffusion models arises from geometry-adaptive harmonic representations

Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, Stéphane Mallat

TL;DR

This work shows that diffusion-based denoisers can generalize beyond memorization, converging to a single density model when trained on sufficiently large, non-overlapping datasets. It reveals that the learned denoisers implement shrinkage in geometry-adaptive harmonic bases (GAHBs), with the Jacobian providing an input-dependent adaptive basis that aligns with image geometry. Through analyses on C^α images, low-dimensional manifolds, and shuffled data, the authors argue that DNN inductive biases bias denoising towards optimal or near-optimal GAHBs, explaining both high sample quality and rapid generalization. The findings connect denoising performance, score estimation, and density modeling, offering a framework to evaluate and understand diffusion-model generalization and its practical impact on image synthesis.

Abstract

Deep neural networks (DNNs) trained for image denoising are able to generate high-quality samples with score-based reverse diffusion algorithms. These impressive capabilities seem to imply an escape from the curse of dimensionality, but recent reports of memorization of the training set raise the question of whether these networks are learning the "true" continuous density of the data. Here, we show that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. We analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. We demonstrate that trained denoisers are inductively biased towards these geometry-adaptive harmonic bases since they arise not only when the network is trained on photographic images, but also when it is trained on image classes supported on low-dimensional manifolds for which the harmonic basis is suboptimal. Finally, we show that when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic, the denoising performance of the networks is near-optimal.

Generalization in diffusion models arises from geometry-adaptive harmonic representations

TL;DR

This work shows that diffusion-based denoisers can generalize beyond memorization, converging to a single density model when trained on sufficiently large, non-overlapping datasets. It reveals that the learned denoisers implement shrinkage in geometry-adaptive harmonic bases (GAHBs), with the Jacobian providing an input-dependent adaptive basis that aligns with image geometry. Through analyses on C^α images, low-dimensional manifolds, and shuffled data, the authors argue that DNN inductive biases bias denoising towards optimal or near-optimal GAHBs, explaining both high sample quality and rapid generalization. The findings connect denoising performance, score estimation, and density modeling, offering a framework to evaluate and understand diffusion-model generalization and its practical impact on image synthesis.

Abstract

Deep neural networks (DNNs) trained for image denoising are able to generate high-quality samples with score-based reverse diffusion algorithms. These impressive capabilities seem to imply an escape from the curse of dimensionality, but recent reports of memorization of the training set raise the question of whether these networks are learning the "true" continuous density of the data. Here, we show that two DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, when the number of training images is large enough. In this regime of strong generalization, diffusion-generated images are distinct from the training set, and are of high visual quality, suggesting that the inductive biases of the DNNs are well-aligned with the data density. We analyze the learned denoising functions and show that the inductive biases give rise to a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous regions. We demonstrate that trained denoisers are inductively biased towards these geometry-adaptive harmonic bases since they arise not only when the network is trained on photographic images, but also when it is trained on image classes supported on low-dimensional manifolds for which the harmonic basis is suboptimal. Finally, we show that when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic, the denoising performance of the networks is near-optimal.
Paper Structure (39 sections, 41 equations, 21 figures, 1 table, 2 algorithms)

This paper contains 39 sections, 41 equations, 21 figures, 1 table, 2 algorithms.

Figures (21)

  • Figure 1: Transition from memorization to generalization, for a UNet denoiser trained on face images. Each curve shows the denoising error (output PSNR, ten times log10 ratio of squared dynamic range to MSE) as a function of noise level (input PSNR), for a training set of size $N$. As $N$ increases, performance on the training set generally worsens (left), while performance on the test set improves (right). For $N=1$ and $N=10$, the train PSNR improves with unit slope, while test PSNR is poor, independent of noise level, a sign of memorization. The increase in test performance on small noise levels at $N=1000$ is indicative of the transition phase from memorization to generalization. At $N = 10^5$, test and train PSNR are essentially identical, and the model is no longer overfitting the training data.
  • Figure 2: Convergence of model variance. Diffusion models are trained on non-overlapping subsets $S_1$ and $S_2$ of a face dataset (filtered for duplicates). The subset size $N$ varies from $1$ to $10^5$. We then generate a sample from each model with a reverse diffusion algorithm, initialized from the same noise image. Top. For training sets of size $N=1$ to $N=100$, the networks memorize, producing samples nearly identical to examples from the training set. For $N = 1000$, generated samples are similar to a training example, but show distortions in some regions. This transitional regime corresponds to a qualitative change in the shape of the PSNR curve (Figure \ref{['fig:psnr-psnr-celeba']}). For $N = 10^5$, the two networks generate nearly identical samples, which no longer resemble images in their corresponding training sets. Bottom. The distribution of cosine similarity (normalized inner product) between pairs of images generated by the two networks (blue) shifts from left to right with increasing $N$, showing vanishing model variance. Conversely, the distribution of cosine similarity between generated samples and the most similar image in their corresponding training set (orange) shifts from right to left. For comparison, \ref{['app:additional-results']} shows the distribution of cosine similarities of closest pairs between the two training subsets, and additional results on the LSUN bedroom dataset yu2015lsun and for the BF-CNN architecture MohanKadkhodaie19b.
  • Figure 3: Analysis of a denoiser trained on $10^5$ face images, evaluated on a noisy test image. Top left. Clean, noisy ($\sigma = 0.15$) and denoised images. Bottom left. Decay of shrinkage values $\lambda_k(y)$ (red), and corresponding coefficients $\mathopen{}\mathclose{\left\langle x, e_k(y) \right\rangle$ (blue), evaluated for the noisy image $y$. The rapid decay of the coefficients indicates that the image content is highly concentrated within the preserved subspace. Right. The adaptive basis vectors $e_k(y)$ contain oscillating patterns, adapted to lie along the contours and within smooth regions of the image, whose frequency increases as $\lambda_k(y)$ decreases.
  • Figure 4: UNet denoisers trained on $10^5$${\bf C}^{\alpha}$ images achieve near-optimal performance. Left. PSNR curves for various regularity levels $\alpha$. The empirical slopes closely match the theoretical optimal slopes (parenthesized values, dashed lines). Right. A ${\bf C}^{\alpha}$ image ($\alpha=4$) of size $80\times80$ and its top eigenvectors, which consist of harmonics on the two regions and harmonics along the boundary. The frequency of the harmonics increases with $k$. More examples are given in \ref{['app:c-alpha-results']}.
  • Figure 5: UNet denoiser trained on a dataset of translating and dilating disks, with variable foreground/background intensity. Top center. Clean, noisy ($\sigma = 0.04$), and denoised images. Bottom center. The decay of shrinkage factors $\lambda_k(y)$ and coefficients $\mathopen{}\mathclose{\left\langle x,e_k(y) \right\rangle$ indicates that the network achieves and preserves a sparse representation of the true image. Top right. denoising performance is sub-optimal, with PSNR slope below the optimal value of $1.0$ for small noise. Top left. An optimal basis (in the small-noise limit) spanning the 5-dimensional tangent space of the image manifold. Bottom left. Top eigenvectors of the adaptive basis. The first five basis vectors closely match the basis of the tangent space of the manifold evaluated at the clean image. In contrast, the next five are GAHBs that lie along contours and within background regions of the clean image.
  • ...and 16 more figures