Table of Contents
Fetching ...

Understanding Generalization in Diffusion Models via Probability Flow Distance

Huijie Zhang, Zijian Huang, Siyi Chen, Jinfan Zhou, Zekai Zhang, Peng Wang, Qing Qu

TL;DR

This paper introduces Probability Flow Distance (PFD), a theoretically grounded and computationally efficient metric for measuring distributional generalization in diffusion models, grounded in the backward PF-ODE noise-to-data mapping. Using a teacher–student protocol, it quantitatively separates memorization from generalization and reveals a scaling law where the memorization-to-generalization transition aligns with the ratio $N / \sqrt{|m{\theta}|}$, along with early learning and double descent dynamics and a bias–variance decomposition of generalization error. The findings illuminate how model capacity and data interact in diffusion models, and demonstrate that PFD can reliably assess generalization beyond generation quality metrics like FID. This framework lays a foundation for principled empirical and theoretical studies of generalization in diffusion models and suggests directions for extending the approach to other generative paradigms.

Abstract

Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance ($\texttt{PFD}$), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, $\texttt{PFD}$ quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using $\texttt{PFD}$ under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.

Understanding Generalization in Diffusion Models via Probability Flow Distance

TL;DR

This paper introduces Probability Flow Distance (PFD), a theoretically grounded and computationally efficient metric for measuring distributional generalization in diffusion models, grounded in the backward PF-ODE noise-to-data mapping. Using a teacher–student protocol, it quantitatively separates memorization from generalization and reveals a scaling law where the memorization-to-generalization transition aligns with the ratio , along with early learning and double descent dynamics and a bias–variance decomposition of generalization error. The findings illuminate how model capacity and data interact in diffusion models, and demonstrate that PFD can reliably assess generalization beyond generation quality metrics like FID. This framework lays a foundation for principled empirical and theoretical studies of generalization in diffusion models and suggests directions for extending the approach to other generative paradigms.

Abstract

Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance (), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.

Paper Structure

This paper contains 30 sections, 4 theorems, 39 equations, 14 figures, 2 tables.

Key Result

Theorem 1

For any two distributions $p$ and $q$, the $\textup{PFD}$ satisfies the following properties:

Figures (14)

  • Figure 1: Comparison of synthetic and real datasets. The figure shows $\texttt{FID}$ and $\mathcal{E}_\mathtt{mem}$ as functions of $\log_2 N$. The green and red lines represent results from the same diffusion model trained and evaluated under real and synthetic data separately.
  • Figure 2: Comparison of practical metrics on the MtoG transition. The top figure plots multiple evaluation metrics as functions of $\log_2 N$. The bottom figure visualizes the generation when $N = 2^6, 2^{12}, 2^{16}$, sampled from the $p_{\mathtt{data}}$ (top row), the $p_{\mathtt{emp}}$ (middle row), and $p_{\bm \theta}$ (bottom row). The same column shared the same initial noise across.
  • Figure 3: Scaling behavior in the MtoG transition.$\mathcal{E}_\mathtt{mem}$ and $\mathcal{E}_\mathtt{gen}$ plotted against $\log_2(N)$ for a range of U-Net architectures (U-Net-1 to U-Net-10). Right: the same metrics plotted against $\log_2(N/\sqrt{|\bm \theta|})$, where $|\bm \theta|$ is the number of model parameters.
  • Figure 4: Training dynamics of diffusion models in different regimes. The top figure plots $\mathcal{E}_\mathtt{mem}, \mathcal{E}_\mathtt{gen}, \ell_{\texttt{train}}, \ell_{\texttt{test}}$ over training epochs for different different dataset sizes: $N = 2^6$ (left), $2^{12}$ (middle), $2^{16}$ (right). The bottom figure visualizes the generation when $N = 2^{12}$. The top row shows samples from the underlying distribution $\bm \Phi_{p_{\mathtt{data}}}(\bm{x}_T)$, while the middle and bottom rows display outputs from the trained diffusion model $\bm \Phi_{p_{\bm \theta}}(\bm{x}_T)$ at epoch 85 and 500, respectively.
  • Figure 5: Bias–Variance Trade-off. (a) plots the generalization error $\mathcal{E}_\mathtt{gen}$, bias $\mathcal{E}_\mathtt{bias}$, and variance $\mathcal{E}_\mathtt{var}$ across different network architectures with a fixed training sample size of $N = 2^{16}$. (b) shows $\mathcal{E}_\mathtt{bias}$ and $\mathcal{E}_\mathtt{var}$ as functions of the number of training samples $N$ for various network architectures.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Definition 1: Probability flow distance ($\textup{PFD}$)
  • Theorem 1
  • Theorem 2
  • Definition 2: Generalization and Memorization Errors
  • Definition 3: Bias-Variance Decomposition of $\mathcal{E}_\mathtt{gen}$
  • proof : Proof of \ref{['thm:properties']}
  • Lemma 1
  • proof : Proof of \ref{['lemma:lip_diff_n2i_mapping']}
  • proof : Proof of \ref{['thm:empirical_approximation']}
  • Example 1
  • ...and 3 more