Table of Contents
Fetching ...

Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

Mang Ning, Mingxiao Li, Le Zhang, Lanmiao Liu, Matthew B. Blaschko, Albert Ali Salah, Itir Onal Ertugrul

Abstract

In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emph{Spectrum Matching Hypothesis}: latents with superior diffusability should (i) follow a flattened power-law PSD (\emph{Encoding Spectrum Matching}, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emph{Decoding Spectrum Matching}, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available https://github.com/forever208/SpectrumMatching.

Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

Abstract

In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emph{Spectrum Matching Hypothesis}: latents with superior diffusability should (i) follow a flattened power-law PSD (\emph{Encoding Spectrum Matching}, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emph{Decoding Spectrum Matching}, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available https://github.com/forever208/SpectrumMatching.
Paper Structure (31 sections, 3 theorems, 30 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 31 sections, 3 theorems, 30 equations, 10 figures, 8 tables, 2 algorithms.

Key Result

Proposition 3.1

Let $\pmb{x}_0$ be a random natural image and $y_0(\omega)\triangleq \mathcal{F}(\pmb{x}_0)(\omega)$ be its Fourier coefficients with power spectral density $S(\omega)\triangleq \mathbb{E}\!\left[|y_0(\omega)|^2\right]=K|\omega|^{-\alpha}$. The diffusion forward process at timestep $t$: implies the diffusion in the Fourier domain with spectrally flat Gaussian noise $\eta(\omega)$. Let $\hat{\pmb

Figures (10)

  • Figure 1: Diagram of ESM and DSM in a typical VAE for latent diffusion.
  • Figure 2: Right side is the directional image by doing magnitude normalization at each pixel, the directional image maintains the spatial structure of the original image (left)
  • Figure 3: PCA visualization (top three principal components) of the latent space of different Autoencoders.
  • Figure 4: Spectrum distributions of the latents
  • Figure 5: Spectrum distributions of the latents.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Proposition 3.1: Power-law PSD aligns diffusion training objective with perceptually dominant structure
  • Proposition 3.2: RMSC is equivalent to directional spectral energy
  • Remark A.1: Gaussian reference for a maximum-entropy upper bound
  • Lemma A.2: Maximum-entropy spectrum under a finite power budget implies flattening effect
  • proof : Sketch of Proof
  • proof