Table of Contents
Fetching ...

Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves

Anand Jerry George, Rodrigo Veiga, Nicolas Macris

TL;DR

This work analyzes generalization and memorization in diffusion models by modeling the score with random features neural networks (RFNNs) and minimizing the denoising score matching (DSM) objective. In a high-dimensional regime where $d,n,p \to \infty$ with fixed ratios $\psi_n=n/d$ and $\psi_p=p/d$, the authors derive exact asymptotic learning curves for two extremes: $m=\infty$ and $m=1$, using the Gaussian Equivalence Principle and linear pencils. A key finding is a phase diagram with a crossover at $p=n$: generalization dominates when $p<n$, while memorization emerges when $p>n$, and larger $m$ amplifies memorization in the overparameterized regime. Theoretical predictions are complemented by numerical simulations on Gaussian data and real datasets (Fashion-MNIST, MNIST) with RFNN and U-Net scores, demonstrating practical relevance of the phase behavior and guiding design choices for diffusion models in practice.

Abstract

We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $ψ_n=\frac{n}{d}$ and $ψ_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $ψ_n,ψ_p$, and $m$. Our theoretical findings are consistent with the empirical observations.

Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves

TL;DR

This work analyzes generalization and memorization in diffusion models by modeling the score with random features neural networks (RFNNs) and minimizing the denoising score matching (DSM) objective. In a high-dimensional regime where with fixed ratios and , the authors derive exact asymptotic learning curves for two extremes: and , using the Gaussian Equivalence Principle and linear pencils. A key finding is a phase diagram with a crossover at : generalization dominates when , while memorization emerges when , and larger amplifies memorization in the overparameterized regime. Theoretical predictions are complemented by numerical simulations on Gaussian data and real datasets (Fashion-MNIST, MNIST) with RFNN and U-Net scores, demonstrating practical relevance of the phase behavior and guiding design choices for diffusion models in practice.

Abstract

We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample () used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being -dimensional Gaussian. We operate in a regime where the dimension , number of data samples , and number of features tend to infinity while keeping the ratios and fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of , and . Our theoretical findings are consistent with the empirical observations.

Paper Structure

This paper contains 33 sections, 7 theorems, 113 equations, 10 figures.

Key Result

Theorem 3.2

Suppose $P_0\equiv\mathcal{N}\left(0,I_d\right)$ and $\varrho$ satisfies Assumption assmptn:activation_fn. Let $s^2 = \mleft\lVert\varrho \mright\rVert^2-c(a_t^2)-h_t\mu_1^2$, $v_0^2=c(a_t^2)-a_t^2\mu_1^2$, and $v^2 = \mleft\lVert\varrho \mright\rVert^2-\mu_1^2$. Let $\psi_n = \frac{n}{d}$, and $\p Define the function $\mathcal{K}(q,z) = -\frac{\zeta_3(q,z)}{a_t\mu_1}$. Let $\varepsilon^\infty_{\

Figures (10)

  • Figure 1: Phase diagram showing regimes of generalization and memorization. The gradient in color with $m$ indicates the change in strength of the phenomenon.
  • Figure 2: Learning curves for $m=\infty$, with $\psi_n=20.0,\lambda=0.001$. The activation function is ReLU.
  • Figure 3: Learning curves for $m=\infty$, with $t=0.01,\lambda=10^{-3}$, $\varrho\equiv$ReLU. Solid (dashed) lines: test (train) error. Dotted vertical lines indicate $\psi_p=\psi_n$.
  • Figure 4: Results of experiments on memorization.
  • Figure 5: Learning curves for $m=1$.We used $\lambda = 0.001$ and the activation function used is ReLU.
  • ...and 5 more figures

Theorems & Definitions (15)

  • Theorem 3.2
  • Remark 3.3
  • Remark 3.4
  • Lemma 3.5
  • Theorem 3.6
  • Lemma B.1
  • proof
  • Theorem B.2
  • Remark B.3
  • proof
  • ...and 5 more