Table of Contents
Fetching ...

Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models

Shinnosuke Saito, Takashi Matsubara

TL;DR

This work tackles the lack of an explicit latent space in diffusion models by introducing a Riemannian metric on the diffusion noise space, grounded in the Jacobian of the score function: $g_{x_t}(v,w) = v^T J_{x_t}^T J_{x_t} w$. By promoting geodesics that stay within or parallel to the learned data manifold, the approach yields manifold-aware interpolations that preserve endpoint semantics. The authors demonstrate three key contributions: (i) a practical metric on noise space without retraining; (ii) a geodesic-based interpolation framework aligned with the data manifold; and (iii) empirical validation across synthetic 2D data, image interpolation, and video-frame interpolation, outperforming density-based and naive baselines in perceptual quality and detail preservation. This method has potential to enhance manifold-aware analysis and editing of diffusion-model outputs, with implications for interpolation, inversion, and editing tasks in real-world applications.

Abstract

Diffusion models are powerful deep generative models (DGMs) that generate high-fidelity, diverse content. However, unlike classical DGMs, they lack an explicit, tractable low-dimensional latent space that parameterizes the data manifold. This absence limits manifold-aware analysis and operations, such as interpolation and editing. Existing interpolation methods for diffusion models typically follow paths through high-density regions, which are not necessarily aligned with the data manifold and can yield perceptually unnatural transitions. To exploit the data manifold learned by diffusion models, we propose a novel Riemannian metric on the noise space, inspired by recent findings that the Jacobian of the score function captures the tangent spaces to the local data manifold. This metric encourages geodesics in the noise space to stay within or run parallel to the learned data manifold. Experiments on image interpolation show that our metric produces perceptually more natural and faithful transitions than existing density-based and naive baselines.

Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models

TL;DR

This work tackles the lack of an explicit latent space in diffusion models by introducing a Riemannian metric on the diffusion noise space, grounded in the Jacobian of the score function: . By promoting geodesics that stay within or parallel to the learned data manifold, the approach yields manifold-aware interpolations that preserve endpoint semantics. The authors demonstrate three key contributions: (i) a practical metric on noise space without retraining; (ii) a geodesic-based interpolation framework aligned with the data manifold; and (iii) empirical validation across synthetic 2D data, image interpolation, and video-frame interpolation, outperforming density-based and naive baselines in perceptual quality and detail preservation. This method has potential to enhance manifold-aware analysis and editing of diffusion-model outputs, with implications for interpolation, inversion, and editing tasks in real-world applications.

Abstract

Diffusion models are powerful deep generative models (DGMs) that generate high-fidelity, diverse content. However, unlike classical DGMs, they lack an explicit, tractable low-dimensional latent space that parameterizes the data manifold. This absence limits manifold-aware analysis and operations, such as interpolation and editing. Existing interpolation methods for diffusion models typically follow paths through high-density regions, which are not necessarily aligned with the data manifold and can yield perceptually unnatural transitions. To exploit the data manifold learned by diffusion models, we propose a novel Riemannian metric on the noise space, inspired by recent findings that the Jacobian of the score function captures the tangent spaces to the local data manifold. This metric encourages geodesics in the noise space to stay within or run parallel to the learned data manifold. Experiments on image interpolation show that our metric produces perceptually more natural and faithful transitions than existing density-based and naive baselines.

Paper Structure

This paper contains 52 sections, 1 theorem, 22 equations, 7 figures, 4 tables.

Key Result

Proposition 1

Minimizing $\|J_{x_t} v\|^2_2$ with respect to a vector $v$ of a fixed Euclidean norm encourages the vector $v$ to lie in the tangent space ${\mathcal{T}}_{x}{\mathcal{M}}_t$.

Figures (7)

  • Figure 1: A conceptual comparison of interpolation. (left) Interpolation paths on a C-shaped distribution. (middle) A plot of the probability density transitions for their corresponding interpolation paths. (right) Examples of image interpolation on Animal Faces-HQ (AF) Choi2022b. LERP cuts through a low-density region, yielding unnatural transitions. SLERP deviates from the manifold, sometimes losing detail textures (see the background in the right panel). Density-based interpolation approaches and traverses a high-density region, not preserving the probabilities of the endpoints and sometimes producing over-smoothed images. Ours runs parallel to the manifold, preserving the probabilities of the endpoints and yielding natural transitions. See Section \ref{['sec:experiments']} for details.
  • Figure 2: Qualitative examples of interpolated image sequences for AF (Dog). The images at both ends are the given endpoints $x_0^{ (0)}$ and $x_0^{ (1)}$, and the middle images are the interpolated results $\{\hat{x}_0^{ (u)}\}$ for $u\in[0,1]$. See also Fig. \ref{['fig:results_qualitative_appendix']} in Appendix \ref{['appendix:additional_results']}.
  • Figure 4: Qualitative examples on video frame interpolations. See also Fig. \ref{['fig:results_video_appendix']} in Appendix \ref{['appendix:additional_results']}.
  • Figure 5: Examples of interpolated image sequences. The leftmost and rightmost images are the given endpoints $x_0^{ (0)}$ and $x_0^{ (1)}$, and the middle images are the interpolated results $\{\hat{x}_0^{ (u)}\}$ for $u\in[0,1]$.
  • Figure 6: Qualitative examples on video frame interpolations
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 1