Analysis of Diffusion Models for Manifold Data
Anand Jerry George, Rodrigo Veiga, Nicolas Macris
TL;DR
We address how time-reversed diffusion with an empirical score behaves when data lie on a $p$-dimensional manifold embedded in a high-dimensional ambient space. Our approach builds a tractable manifold data model and leverages the exact mutual information of generalized linear models to characterize the dynamical transitions, deriving explicit formulas for speciation and collapse times in the limit $d\to\infty$, $p=\beta d$, $n=e^{\alpha d}$. Key contributions include $t_S \approx \frac{1}{2}\log\left[2\left(\varrho_1^2 \beta d \|\tilde{\mu}\|^2 + \varrho_*^2\right)\right]$ for odd activations with opposite centers, and $t_C = \frac{1}{2}\log\left(1+\left(e^{2\alpha/\beta}-1\right)^{-1}\right)$ in the linear-manifold case, with general manifolds handled via RS free-energy and REM-based arguments. The results show that manifold structure can significantly reduce the timescales of specialization and collapse, implying exponential-scale sample requirements $O(e^{p})$ to maintain finite dynamical times, and they connect diffusion dynamics to phase transitions in generalized linear models. This framework provides theoretical insights into diffusion-based generation on structured data and informs considerations of memory and sample efficiency for high-dimensional, low-dimensional-structure data alike.
Abstract
We analyze the time reversed dynamics of generative diffusion models. If the exact empirical score function is used in a regime of large dimension and exponentially large number of samples, these models are known to undergo transitions between distinct dynamical regimes. We extend this analysis and compute the transitions for an analytically tractable manifold model where the statistical model for the data is a mixture of lower dimensional Gaussians embedded in higher dimensional space. We compute the so-called speciation and collapse transition times, as a function of the ratio of manifold-to-ambient space dimensions, and other characteristics of the data model. An important tool used in our analysis is the exact formula for the mutual information (or free energy) of Generalized Linear Models.
