Table of Contents
Fetching ...

Kernel-Smoothed Scores for Denoising Diffusion: A Bias-Variance Study

Franck Gabriel, François Ged, Maria Han Veiga, Emmanuel Schertzer

TL;DR

This work analyzes memorization in diffusion-based generative models and proposes kernel-smoothed empirical scores (LED-KDE) to regularize the reverse diffusion. It proves a central limit theorem for the mollified score, develops a bias-variance decomposition, and derives KL-divergence bounds showing that mollification reduces sampling variance and mitigates memorization while preserving manifold structure. The LED-KDE framework effectively enlarges the data-set influence and enables smaller diffusion times, improving generalization. Numerical experiments on synthetic data and MNIST illustrate the reduction of memorization and smoother generation when using kernel-smoothed scores.

Abstract

Diffusion models now set the benchmark in high-fidelity generative sampling, yet they can, in principle, be prone to memorization. In this case, their learned score overfits the finite dataset so that the reverse-time SDE samples are mostly training points. In this paper, we interpret the empirical score as a noisy version of the true score and show that its covariance matrix is asymptotically a re-weighted data PCA. In large dimension, the small time limit makes the noise variance blow up while simultaneously reducing spatial correlation. To reduce this variance, we introduce a kernel-smoothed empirical score and analyze its bias-variance trade-off. We derive asymptotic bounds on the Kullback-Leibler divergence between the true distribution and the one generated by the modified reverse SDE. Regularization on the score has the same effect as increasing the size of the training dataset, and thus helps prevent memorization. A spectral decomposition of the forward diffusion suggests better variance control under some regularity conditions of the true data distribution. Reverse diffusion with kernel-smoothed empirical score can be reformulated as a gradient descent drifted toward a Log-Exponential Double-Kernel Density Estimator (LED-KDE). This perspective highlights two regularization mechanisms taking place in denoising diffusions: an initial Gaussian kernel first diffuses mass isotropically in the ambient space, while a second kernel applied in score space concentrates and spreads that mass along the data manifold. Hence, even a straightforward regularization-without any learning-already mitigates memorization and enhances generalization. Numerically, we illustrate our results with several experiments on synthetic and MNIST datasets.

Kernel-Smoothed Scores for Denoising Diffusion: A Bias-Variance Study

TL;DR

This work analyzes memorization in diffusion-based generative models and proposes kernel-smoothed empirical scores (LED-KDE) to regularize the reverse diffusion. It proves a central limit theorem for the mollified score, develops a bias-variance decomposition, and derives KL-divergence bounds showing that mollification reduces sampling variance and mitigates memorization while preserving manifold structure. The LED-KDE framework effectively enlarges the data-set influence and enables smaller diffusion times, improving generalization. Numerical experiments on synthetic data and MNIST illustrate the reduction of memorization and smoother generation when using kernel-smoothed scores.

Abstract

Diffusion models now set the benchmark in high-fidelity generative sampling, yet they can, in principle, be prone to memorization. In this case, their learned score overfits the finite dataset so that the reverse-time SDE samples are mostly training points. In this paper, we interpret the empirical score as a noisy version of the true score and show that its covariance matrix is asymptotically a re-weighted data PCA. In large dimension, the small time limit makes the noise variance blow up while simultaneously reducing spatial correlation. To reduce this variance, we introduce a kernel-smoothed empirical score and analyze its bias-variance trade-off. We derive asymptotic bounds on the Kullback-Leibler divergence between the true distribution and the one generated by the modified reverse SDE. Regularization on the score has the same effect as increasing the size of the training dataset, and thus helps prevent memorization. A spectral decomposition of the forward diffusion suggests better variance control under some regularity conditions of the true data distribution. Reverse diffusion with kernel-smoothed empirical score can be reformulated as a gradient descent drifted toward a Log-Exponential Double-Kernel Density Estimator (LED-KDE). This perspective highlights two regularization mechanisms taking place in denoising diffusions: an initial Gaussian kernel first diffuses mass isotropically in the ambient space, while a second kernel applied in score space concentrates and spreads that mass along the data manifold. Hence, even a straightforward regularization-without any learning-already mitigates memorization and enhances generalization. Numerically, we illustrate our results with several experiments on synthetic and MNIST datasets.

Paper Structure

This paper contains 39 sections, 4 theorems, 127 equations, 14 figures, 1 table.

Key Result

Proposition 1

Suppose that Assumption Assumption linear manifold holds and that $\mathcal{M}=\operatorname{span} \{e_1,\dots,e_k \}\subset \mathbb{R}^d$ wlog. Let $\mathcal{G}_{t}^{\mathcal{M}}$ be the Gaussian kernel $\mathcal{N}(0,t\mathrm{Id}_{k}\oplus0_{d-k})$. The measure $(\mathcal{G}_{\sigma^{2}}^{\mathcal where on the RHS the first measure is interpreted as a measure on $\mathbb{R}^k$.

Figures (14)

  • Figure 1: Left: analytical score. Middle: analytical score convolved with a Gaussian kernel with standard deviation $\sigma=0.15$. Right: neural network approximation of score.
  • Figure 2: Left: True probability measure $p_*$ convolved with a Gaussian kernel with $\sigma=0.02$, $\mathcal{G}_{0.02}$. Middle: KDE with the Gaussian kernel $\mathcal{G}_{0.02}$. Right: LED-KDE at time $0.02$ with $K=\mathcal{G}_{0.04}$.
  • Figure 3: Left: Eigenvector with non-zero corresponding eigenvalue aligned with the data manifold. Right: Scaling of the eigenvalue $\lambda_1$ of empirical covariance matrix ($N=10000$). The slope encodes the intrinsic dimension of the manifold.
  • Figure 4: Left: KL-divergence between $\mathcal{G}_{t_N}\star p_*$ and the empirical measure generated by following the score (red) and the KL-divergence between $\mathcal{G}_{t_N}\star p_*$ and the empirical measure generated by following the mollified score, varying $h$ (blue). Right: Ratio $N_{\mathrm{eff}}/N$ at the lowest reported KL-divergence. In both figures, $p_*$ is multi-dimensional Gaussian ($d=4$) and $N=100$.
  • Figure 5: Left: KDE with kernel $C_{0.5}$. Right: LED-KDE $(C_{0.47}, C_{0.5}) \star p^N_0$.
  • ...and 9 more figures

Theorems & Definitions (9)

  • Definition 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Remark 5
  • proof : (Proof of Theorem \ref{['thm: covariance asymptotics t to 0']})
  • proof : Proof of Theorem \ref{['thm: bounds bias-variance from CLT']}
  • proof : Proof of Theorem \ref{['thm: KL regimes']}