Table of Contents
Fetching ...

Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu

TL;DR

The paper addresses why diffusion models can learn realistic distributions from relatively few samples by modeling data as a union of low-dimensional subspaces via a mixture of low-rank Gaussians (MoLRG) and parameterizing the denoising autoencoder accordingly. It establishes a theoretical equivalence between diffusion-model training and subspace clustering, proving that the minimal sample complexity scales linearly with the intrinsic dimension under MoLRG assumptions, and identifies a phase transition in learnability. The work further demonstrates a practical link between the learned subspaces and semantic attributes, enabling editing operations, and provides empirical validation on simulated MoLRG data and real image datasets. These results offer a principled explanation for the observed data-efficient learning of diffusion models and suggest avenues for improving generalization and editing capabilities in practice.

Abstract

Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.

Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

TL;DR

The paper addresses why diffusion models can learn realistic distributions from relatively few samples by modeling data as a union of low-dimensional subspaces via a mixture of low-rank Gaussians (MoLRG) and parameterizing the denoising autoencoder accordingly. It establishes a theoretical equivalence between diffusion-model training and subspace clustering, proving that the minimal sample complexity scales linearly with the intrinsic dimension under MoLRG assumptions, and identifies a phase transition in learnability. The work further demonstrates a practical link between the learned subspaces and semantic attributes, enabling editing operations, and provides empirical validation on simulated MoLRG data and real image datasets. These results offer a principled explanation for the observed data-efficient learning of diffusion models and suggest avenues for improving generalization and editing capabilities in practice.

Abstract

Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.
Paper Structure (45 sections, 11 theorems, 94 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 45 sections, 11 theorems, 94 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that the DAE $\bm x_{\bm \theta}(\cdot,t)$ in Problem (eq:em loss) is parameterized into eq:para Gau for each $t \in [0,1]$. Then, Problem (eq:em loss) is equivalent to the following principal component analysis (PCA) problem:

Figures (8)

  • Figure 1: Comparison of images generated from the Gaussian, MoLRG, and the distribution learned by diffusion models across different datasets. Each row displays images generated from different distributions using the reverse-time ODE sampler, including the Gaussian, MoLRG, and the distribution learned by U-Net. The columns represent images generated from the same initial noise. The results are shown for four datasets: FashionMNIST (top left ), MNIST (top right), CIFAR-10 (bottom left), and FFHQ (bottom right ).
  • Figure 2: Phase transition of learning the MoLRG distribution with $K = 1$. The $x$-axis is the number of training samples and $y$-axis is the dimension of subspaces. Darker pixels represent a lower empirical probability of success. We apply SVD and stochastic gradient descent to solve Problems \ref{['eq:PCA']} and \ref{['eq:em loss']}, visualizing the results in (a) and (b), respectively.
  • Figure 3: Phase transition of learning the MoLRG distribution with $K = 2$. The $x$-axis is the number of training samples and $y$-axis is the dimension of subspaces. Darker pixels represent a lower empirical probability of success. We apply a subspace clustering method and stochastic gradient descent to solve Problems \ref{['eq:SC']} and \ref{['eq:em loss']}, visualizing the results in (a) and (b), respectively. Additional experiments for the case when $K = 3$ are presented in \ref{['fig:phase-transition-MoG-add-exp']}.
  • Figure 4: Phase transition of generalization using U-Net. Diffusion models with a U-Net architecture are trained on synthetic data sampled from the MoLRG distribution (left column; $K = 2$, $n = 48$, varying intrinsic dimensions) and on real image datasets: CIFAR-10, CelebA, FFHQ, and AFHQ (right column). The GL score is plotted against the ratio of training samples to the intrinsic dimension (top row) and to the square of the intrinsic dimension (bottom row). A black dashed line fits the data across different intrinsic dimensions (datasets) for each figure. A GL score above 0.95 (within the dark grey region) indicates good generalization, while a score below 0.95 ( within the light grey region) indicates poor generalization.
  • Figure 5: Correspondence between the singular vectors of the Jacobian of the DAE and semantic image attributes. We use a pre-trained DDPM with U-Net on the MetFaces dataset metafaces. We edit the original image $\bm x_0$ by changing $\bm x_t$ into $\bm x_t + \alpha \bm q_i$, where $\bm q_i$ is a singular vector of the Jacobian of the DAE $\bm x_{\bm \theta}(\bm x_t,t)$.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Definition 1: Mixture of Low-Rank Gaussians
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • proof
  • ...and 11 more