Table of Contents
Fetching ...

Your diffusion model secretly knows the dimension of the data manifold

Jan Stanczuk, Georgios Batzolis, Teo Deveney, Carola-Bibiane Schönlieb

TL;DR

This work addresses intrinsic dimensionality estimation by leveraging diffusion-model scores near the data manifold. It shows that for small diffusion times the score direction concentrates in the normal bundle, enabling a practical SVD-based estimator that counts vanishing singular values to infer the manifold's intrinsic dimension. Empirical results on Euclidean and image manifolds, including MNIST, demonstrate superior accuracy relative to traditional estimators like MLE, Local PCA, and PPCA, and provide new MNIST dimension insights. The study highlights diffusion models as a tool not only for generation but also for uncovering underlying geometric structure in data, with potential broad impact across domains.

Abstract

In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. We prove that, if the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first estimator of the data manifold dimension based on diffusion models and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.

Your diffusion model secretly knows the dimension of the data manifold

TL;DR

This work addresses intrinsic dimensionality estimation by leveraging diffusion-model scores near the data manifold. It shows that for small diffusion times the score direction concentrates in the normal bundle, enabling a practical SVD-based estimator that counts vanishing singular values to infer the manifold's intrinsic dimension. Empirical results on Euclidean and image manifolds, including MNIST, demonstrate superior accuracy relative to traditional estimators like MLE, Local PCA, and PPCA, and provide new MNIST dimension insights. The study highlights diffusion models as a tool not only for generation but also for uncovering underlying geometric structure in data, with potential broad impact across domains.

Abstract

In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. We prove that, if the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first estimator of the data manifold dimension based on diffusion models and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.
Paper Structure (26 sections, 8 theorems, 39 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 8 theorems, 39 equations, 18 figures, 4 tables, 1 algorithm.

Key Result

Theorem 5.1

Suppose that the the support of the data distribution $P_0$ is contained in a compact embedded sub-manifold $\mathcal{M} \subseteq \mathbb{R}^d$ and let $P_t$ be the distribution of samples from $P_0$ diffused for time $t$. Then, under mild assumptions, for any point $\textup{x} \in \mathbb{R}^d$ su where $\text{S}_{\cos}$ denotes the cosine similarity. In other words, for sufficiently small $t$ t

Figures (18)

  • Figure 1: The data manifold (in blue) and the neural approximation of the score field $\nabla_\textbf{x} \ln p_{t_0}(\textbf{x})$ obtained from a diffusion model. Near the manifold the score field is perpendicular to the manifold surface.
  • Figure 2: The red dot shows a point $\textbf{x}_0$ on the data manifold where we wish to estimate the dimension. We sample $K$ blue points $\textbf{x}_t^{(i)}$ in a close neighbourhood of the red point and evaluate the score field. The resulting vectors $s_\theta(\textbf{x}_\epsilon^{(i)}, \epsilon)$ will point in the normal direction. We put the vectors into a matrix and perform SVD to detect the dimension of the normal space. The dimension of the manifold will be equal to the number of (almost) vanishing singular values.
  • Figure 3: Singular values for the scores of $k$-sphere for $k=10, 50$. In both cases around $k$ singular values almost vanish, clearly indicating the dimensionality of the manifold. Each line shows a score spectrum at different $\textbf{x}_0^{(j)}$.
  • Figure 4: Auto-encoder reconstruction error on MNIST for different latent space dimensions. Vertical lines mark different estimations of intrinsic dimension.
  • Figure 5: MNIST score spectra that yielded the highest estimated dimension for each digit
  • ...and 13 more figures

Theorems & Definitions (13)

  • Theorem 5.1
  • Corollary 5.2
  • Theorem D.1
  • Definition D.2
  • Definition D.3
  • Theorem D.4
  • Lemma D.5
  • proof
  • Theorem D.6
  • Definition D.7
  • ...and 3 more