Table of Contents
Fetching ...

Improving the Diffusability of Autoencoders

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin

TL;DR

The paper tackles a gap in latent diffusion modeling by focusing on diffusability—the spectral alignment between autoencoder latents and RGB signals. It identifies that high-frequency content in latent spaces, especially with larger bottleneck channels, disrupts the coarse-to-fine diffusion process and harms generation quality. A simple scale equivariance regularization, implemented via downsampling consistency between latents and RGB, reduces these high-frequency components while preserving reconstruction, leading to sizable gains: up to about 19% lower FID on ImageNet-1K-$256^2$ and at least 44% lower FVD on Kinetics-700-$17\times256^2$, across multiple autoencoders and diffusion backbones. The results demonstrate that modest code changes and limited fine-tuning can substantially enhance diffusability and downstream generation quality, with a clear path for future extensions to adaptive spectral regularization and temporal scale-equivariance for video models.

Abstract

Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K $256^2$ and FVD by at least 44% for video generation on Kinetics-700 $17 \times 256^2$. The source code is available at https://github.com/snap-research/diffusability.

Improving the Diffusability of Autoencoders

TL;DR

The paper tackles a gap in latent diffusion modeling by focusing on diffusability—the spectral alignment between autoencoder latents and RGB signals. It identifies that high-frequency content in latent spaces, especially with larger bottleneck channels, disrupts the coarse-to-fine diffusion process and harms generation quality. A simple scale equivariance regularization, implemented via downsampling consistency between latents and RGB, reduces these high-frequency components while preserving reconstruction, leading to sizable gains: up to about 19% lower FID on ImageNet-1K- and at least 44% lower FVD on Kinetics-700-, across multiple autoencoders and diffusion backbones. The results demonstrate that modest code changes and limited fine-tuning can substantially enhance diffusability and downstream generation quality, with a clear path for future extensions to adaptive spectral regularization and temporal scale-equivariance for video models.

Abstract

Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K and FVD by at least 44% for video generation on Kinetics-700 . The source code is available at https://github.com/snap-research/diffusability.

Paper Structure

This paper contains 23 sections, 10 equations, 25 figures, 10 tables.

Figures (25)

  • Figure 1: Convergence speed of DiT-XL/2 on top of vanilla FluxAE vs FluxAE fine-tuned for 10K steps with scale equivariance (SE) regularization on ImageNet-1K-$256^2$; and on top of CogVideoX-AE vs CogVideoX-AE with SE on Kinetics-700-$17\times256^2$. Our regularization improves the performance of image and video LDMs by refinng the frequency profile of their autoencoders' latent spaces.
  • Figure 2: Latent frequency profiles of FluxAE autoencoders of varying bottleneck sizes, and also RGB (of the same $32^2$ spatial dimension). One can notice two things: 1) the latent space of an autoencoder exhibits a different power profile from RGB; and 2) high frequency amplitudes increase with the latent channel size.
  • Figure 3: Spectrums for FluxAE autoencoders trained (from scratch) with different KL regularization strengths. KL regularization is a double-edged sword: it pushes the latents distribution closer to standard Gaussian (the distribution the reverse diffusion process starts with), so that the LDM has less work to do LSGM, but it also introduces high-frequency components into the latents due to the random noise addition (see \ref{['fig:rgb-noise']}), which LDM is forced to model as well. \ref{['fig:rgb-noise']} shows the influence of noise addition on the frequency profile.
  • Figure 4: DCT Spectrum of the FluxAE latents with and without scale equivariance (SE) regularization. Fine-tuning AEs with SE brings the spectrum closer to the RGB domain, the higher the regularization strength.
  • Figure 5: RGB and autoencoder reconstructions with progressively erased DCT high-frequency components. RGB faces minimal degradation (top), as a higher percentage of the latent DCT spectrum is removed, but the Flux AE reconstructions (middle) quickly degrade when the high-frequency components from the latents are being removed. A high-frequency cutoff regularization forces the autoencoder to rely more on the low frequency region of the latents and leads to better compression and resilience to high-frequency error accumulation in diffusion models.
  • ...and 20 more figures