Table of Contents
Fetching ...

Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

Yu-Han Wu, Pierre Marion, Gérard Biau, Claire Boyer

TL;DR

Diffusion-based denoising score matching can lead to memorization if the empirical optimal score is learned exactly, but practical training does not reach that extreme. The authors show that the empirical score $s^*$ is highly irregular in the small-noise limit and that SGD with a large learning rate creates an implicit regularization that prevents convergence to a near-perfect empirical minimizer. Through a one-dimensional analysis of two-layer ReLU networks, they derive bounds linking the learning rate, noise level, and score regularity, and they validate the theory with experiments across dimensions. The results reveal a principled mechanism by which large learning rates mitigate memorization, with implications for training diffusion models and privacy considerations in generative systems.

Abstract

Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score--the exact solution to the denoising score matching--leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

TL;DR

Diffusion-based denoising score matching can lead to memorization if the empirical optimal score is learned exactly, but practical training does not reach that extreme. The authors show that the empirical score is highly irregular in the small-noise limit and that SGD with a large learning rate creates an implicit regularization that prevents convergence to a near-perfect empirical minimizer. Through a one-dimensional analysis of two-layer ReLU networks, they derive bounds linking the learning rate, noise level, and score regularity, and they validate the theory with experiments across dimensions. The results reveal a principled mechanism by which large learning rates mitigate memorization, with implications for training diffusion models and privacy considerations in generative systems.

Abstract

Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score--the exact solution to the denoising score matching--leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

Paper Structure

This paper contains 45 sections, 14 theorems, 152 equations, 5 figures.

Key Result

Theorem 1

(informal) Consider the denoising score matching objective $\mathcal{R}_n$ over the class of two-layer neural networks for one-dimensional data. Then, for a sufficiently small level of noise $\sigma$ and a learning rate $\eta \gtrsim \sigma^2$, the stochastic gradient descent on $\mathcal{R}_n$ cann

Figures (5)

  • Figure 1: Graphs of the learned model $s_{\theta^\star}$ with different learning rates and of the empirical optimal score $s^\star$, for two pairs of $(\mu, \sigma)$. As the learning rate decreases, $s_{\theta^\star}$ approaches $s^\star$. When $\sigma$ is smaller (right plot), $s^\star$ is more irregular, and a smaller learning rate is needed for $s_{\theta^\star}$ to approach $s^\star$.
  • Figure 2: Excess risk of the learned model $s_{\theta^\star}$ trained with different learning rates, for two pairs of $(\mu, \sigma)$ and two dimensions of the data ($d=1$, left, and $d=10$, right). The $x$-axis is in logarithmic scale while the $y$-axis is in standard scale. Confidence intervals are computed with 30 simulations.
  • Figure 3: (left) Sample generated by $s^\star$ and $s_{\theta^\star}$ fitted with learning rate $0.05$. The training data are the blue points. (middle) Same with $s_{\theta^\star}$ fitted with learning rate $2$. (right) The green marked curve corresponds to the MMD between observations generated by $s^\star$ and observations generated by $s_{\theta^\star}$ (for different learning rates). The pink curve is the MMD between observations following the Gaussian distribution fitted on the training data and observations generated by $s_{\theta^\star}$.
  • Figure 4: (left) Sample generated by $s_{\theta^\star}$ in dimension $10$, projected on the first two axes. The training data are the blue points. (middle) Same in dimension $400$. (right) The green marked curve corresponds to the MMD between observations generated by $s^\star$ and observations generated by $s_{\theta^\star}$, depending on the dimension. The pink curve is the MMD between observations following the Gaussian distribution fitted on the training data and observations generated by $s_{\theta^\star}$. Both distances are normalized by the MMD between observations generated by $s^\star$ and by the Gaussian distribution.
  • Figure 5: Largest eigenvalue of the loss Hessian (or sharpness) at the end of training, as a function of the learning rate. (left) and (middle) for the experiment of Figures \ref{['fig:exp-plot-function']} and \ref{['fig:excess-loss-1d']}, for $d=1$ and $d=10$ respectively, with $(\mu, \sigma) = (0.81, 0.57)$. (right) for the experiment of Figure \ref{['fig:exp-2d-memorization']}.

Theorems & Definitions (14)

  • Theorem 1
  • Proposition 2
  • Corollary 3
  • Theorem 4
  • Lemma 5
  • Proposition 6
  • Proposition 7
  • Theorem 8
  • Proposition 9
  • Proposition 10
  • ...and 4 more