Table of Contents
Fetching ...

What Secrets Do Your Manifolds Hold? Understanding the Local Geometry of Generative Models

Ahmed Imtiaz Humayun, Ibtihel Amara, Cristina Vasconcelos, Deepak Ramachandran, Candice Schumann, Junfeng He, Katherine Heller, Golnoosh Farnadi, Negar Rostamzadeh, Mohammad Havaei

TL;DR

This work analyzes the local geometry of pre-trained generative models by adopting continuous piecewise-linear (CPWL) theory and defining three descriptors: local scaling $\psi_\omega$, local rank $\nu_\omega$, and local complexity $\delta_z$. It demonstrates that these descriptors correlate with downstream generation aspects such as aesthetics, diversity, and memorization across diffusion-based architectures, including DDPM, Diffusion Transformer, and Stable Diffusion, and distinguishes on-manifold from off-manifold regions. The authors also show that training a reward model on local scaling enables geometry-guided denoising to boost texture, diversity, and perceptual quality at the instance level. Overall, the study links the learned manifold’s geometry to generation outcomes and proposes geometry-driven tools for OOD detection and targeted guidance, while acknowledging computational costs and dependence on training dynamics.

Abstract

Deep Generative Models are frequently used to learn continuous representations of complex data distributions using a finite number of samples. For any generative model, including pre-trained foundation models with Diffusion or Transformer architectures, generation performance can significantly vary across the learned data manifold. In this paper we study the local geometry of the learned manifold and its relationship to generation outcomes for a wide range of generative models, including DDPM, Diffusion Transformer (DiT), and Stable Diffusion 1.4. Building on the theory of continuous piecewise-linear (CPWL) generators, we characterize the local geometry in terms of three geometric descriptors - scaling ($ψ$), rank ($ν$), and complexity/un-smoothness ($δ$). We provide quantitative and qualitative evidence showing that for a given latent-image pair, the local descriptors are indicative of generation aesthetics, diversity, and memorization by the generative model. Finally, we demonstrate that by training a reward model on the local scaling for Stable Diffusion, we can self-improve both generation aesthetics and diversity using `geometry reward' based guidance during denoising.

What Secrets Do Your Manifolds Hold? Understanding the Local Geometry of Generative Models

TL;DR

This work analyzes the local geometry of pre-trained generative models by adopting continuous piecewise-linear (CPWL) theory and defining three descriptors: local scaling , local rank , and local complexity . It demonstrates that these descriptors correlate with downstream generation aspects such as aesthetics, diversity, and memorization across diffusion-based architectures, including DDPM, Diffusion Transformer, and Stable Diffusion, and distinguishes on-manifold from off-manifold regions. The authors also show that training a reward model on local scaling enables geometry-guided denoising to boost texture, diversity, and perceptual quality at the instance level. Overall, the study links the learned manifold’s geometry to generation outcomes and proposes geometry-driven tools for OOD detection and targeted guidance, while acknowledging computational costs and dependence on training dynamics.

Abstract

Deep Generative Models are frequently used to learn continuous representations of complex data distributions using a finite number of samples. For any generative model, including pre-trained foundation models with Diffusion or Transformer architectures, generation performance can significantly vary across the learned data manifold. In this paper we study the local geometry of the learned manifold and its relationship to generation outcomes for a wide range of generative models, including DDPM, Diffusion Transformer (DiT), and Stable Diffusion 1.4. Building on the theory of continuous piecewise-linear (CPWL) generators, we characterize the local geometry in terms of three geometric descriptors - scaling (), rank (), and complexity/un-smoothness (). We provide quantitative and qualitative evidence showing that for a given latent-image pair, the local descriptors are indicative of generation aesthetics, diversity, and memorization by the generative model. Finally, we demonstrate that by training a reward model on the local scaling for Stable Diffusion, we can self-improve both generation aesthetics and diversity using `geometry reward' based guidance during denoising.
Paper Structure (27 sections, 5 equations, 32 figures)

This paper contains 27 sections, 5 equations, 32 figures.

Figures (32)

  • Figure 1: Controlling visual complexity using geometry guidance. We train a reward model on the geometric descriptor local scaling computed for the decoder of Stable Diffusion rombach2021high. Increasing local scaling, i.e., volume dilation (top-row) or increasing local volume contraction (bottom-row) of the denoised samples via guidance $\rho$, results in increased (top-row) or decreased (bottom-row) visual complexity of the generated samples. As we increase local scaling, more background elements come into view and the focus on the subject decreases, vice-versa when decreasing local scaling.
  • Figure 2: The geometry of a continuous piecewise-linear toy generator. For a CPWL generator $\mathcal{G}:\mathbb{R}^2 \rightarrow \mathbb{R}^3$, we provide analytically computed visualization of the input space partition, i.e., arrangement of linear regions (left) and learned CPWL manifold (middle-left). Each piece for this example, is colored by the piecewise-constant scaling induced by $\mathcal{G}$ that is also analytically computed. Uniform samples from the latent domain (middle-right) and generated samples (right) are presented, colored by the estimated density at each sample using a gaussian kernel density estimator in $\mathbb{R}^3$. We see that for any sample $z\in\omega$, the estimated density ($\uparrow$ green) is inversely proportional to the scaling ($\downarrow$ green) for region $\omega$.
  • Figure 3: Local geometric descriptors computed over the input domain of a DDPM trained on samples from a toy dinosaur manifold $\mathcal{M} \in \mathbb{R}^2$, conditioned on $t=0.22T$ (left three columns). Denoising dynamics of local descriptors (right three columns) for different number of training optimization steps. X-axis represents noise level $t=[T,0]$.
  • Figure 4: Geometry of the Stable Diffusion latent space. Geometric descriptors (left, middle-left, middle-right) visualized on a 2D latent space subspace, that passes through the latent representations of "a fox", "a cat" and "a dog" (right), denoted via markers on the 2D subspace descriptor. In Appendix, we provide denoised images for different high/low descriptor regions from the subspace. We see that in the convex hull of the three anchor latent vectors $\psi \uparrow$, $\nu \downarrow$ and $\delta \uparrow$. Moreover we see that in the convex hull, the local rank $\nu$ undergoes sharp changes which are not visible towards the edges of the domain.
  • Figure 5: Local Geometry level sets Imagenet prompts. Vendi diversity scores and RAHF liang2024rich aesthetic scores computed for images with classifier free guidance (CFG) 7, 5 and 3. Diversity per level set increases and then decrease with increased local scaling. Aesthetic score slightly increases and then decreases as well with increased local scaling.
  • ...and 27 more figures