Table of Contents
Fetching ...

Understanding Latent Diffusability via Fisher Geometry

Jing Gu, Morteza Mardani, Wonjun Lee, Dongmian Zou, Gilad Lerman

Abstract

Diffusion models often degrade when trained in latent spaces (e.g., VAEs), yet the formal causes remain poorly understood. We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder's local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability. Experiments across diverse autoencoding architectures validate our framework and establish these efficient FI and FIR metrics as a robust diagnostic suite for identifying and mitigating latent diffusion failure.

Understanding Latent Diffusability via Fisher Geometry

Abstract

Diffusion models often degrade when trained in latent spaces (e.g., VAEs), yet the formal causes remain poorly understood. We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder's local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability. Experiments across diverse autoencoding architectures validate our framework and establish these efficient FI and FIR metrics as a robust diagnostic suite for identifying and mitigating latent diffusion failure.

Paper Structure

This paper contains 71 sections, 6 theorems, 98 equations, 13 figures, 3 tables.

Key Result

Proposition 2.1

For every $\tau>0$, $\blacktriangleleft$$\blacktriangleleft$

Figures (13)

  • Figure 1: Geometric interpretation of encoder assumptions.
  • Figure 2: Values of $\mathcal{I}$ (left) and $\mathcal{R}$ (right) plotted versus the noise variance $\tau$, computed from tiny diffusion models trained on different data representations. Pixel curves correspond to models trained on $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_2)$. Latent curves correspond to models trained on encoded data $\mathbf{z}= E(\mathbf{x})$, where the pointwise activation $E$ is indicated in the legend; for Leaky ReLU, $\alpha$ denotes the negative slope.
  • Figure 3: FIR deviation $\mathcal{D}_{\mathcal{R}}$ vs. (a) $\delta_0$, (b) $d$, and (c) $\varepsilon_0$ in toy settings. Data $\mathbf{y}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_2)$ are embedded as $\mathbf{x}=(y_1, y_2,0,\ldots,0)\in\mathbb R^D$ and encoded to $\mathbf{z}=E(\mathbf{x})\in\mathbb R^d$. We compute $\mathcal{D}_{\mathcal{R}}$ from $\mathcal{R}^{(D)}(\mu_\tau)$ and $\mathcal{R}^{(d)}((\mu_Z)_\tau)$ using diffusion models trained on $\mathbf{x}$ and $\mathbf{z}$. Curves denote fixed noise variance $\tau$. Solid lines $y=1.25\delta_0$ (a) and $y=\frac{D-d}{D-2}$ (b) serve as linear references. Encoder $E$ setups: (a) $D=d=2$, $E(\mathbf{x})=\mathbf{A}\mathbf{x}$ with $\mathbf{A}=\mathrm{diag}(\sqrt{1+\delta_0},\sqrt{1-\delta_0})$. (b) $D=512$ with $\mathbf{z}=(y_1, y_2,0,\ldots,0) \in\mathbb R^d$. (c) $D=d=3$ with $E((x_1,x_2,0),\varepsilon_0)=\bigl(\sin(\varepsilon_0 x_1)/\varepsilon_0,\; x_2,\; (1-\cos(\varepsilon_0 x_1))/\varepsilon_0\bigr)$. (d) Real data (Sec. \ref{['sec:exp_GPE']}): $\mathcal{D}_{\mathcal{R}}^{\mathrm{sc}}$ vs. $\tau$ ($D=64\times64\times3$, $d=256$). FFHQ images $\mathbf{x}$ are encoded to latents $\mathbf{z}=E(\mathbf{x})$ via GPE or VAE (see legend).
  • Figure 4: Values of (a) $\mathcal{I}$ and (b) $\mathcal{R}$ plotted versus the noise variance $\tau$, computed from diffusion models trained on different data representations. The pixel curves correspond to models trained directly on FFHQ images. The latent curves correspond to models trained on latent representations of an image encoder (GPE or VAE) pretrained on FFHQ. We show $\sqrt{\tau} \in [0.01,80]$, excluding smaller $\tau$ due to numerical instability.
  • Figure 5: (a) Samples from a diffusion model trained directly on $64\times64$ FFHQ. (b, c) Samples from latent diffusion models trained on (b) GPE and (c) VAE representations. Generated latents are mapped back to image space using their respective decoders.
  • ...and 8 more figures

Theorems & Definitions (13)

  • Proposition 2.1: Hessian representation of FIR
  • Proposition 3.1: Fisher Information Bounds for Bi-Lipschitz Encoders
  • Definition 3.2: FIR Deviation
  • Theorem 3.4: Linear Stability
  • Theorem 3.5: Nonlinear Stability
  • Proposition 3.6
  • Remark 3.7: On the Magnitude of $\varepsilon$
  • proof
  • Lemma B.1: Exact normal-direction contribution
  • proof
  • ...and 3 more