Table of Contents
Fetching ...

Geometric Autoencoder for Diffusion Models

Hangyu Liu, Jianyong Wang, Yutao Sun

TL;DR

Geometric Autoencoder (GAE) is proposed, a principled framework that systematically addresses challenges of latent diffusion modeling and establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability.

Abstract

Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/sii-research/GAE.

Geometric Autoencoder for Diffusion Models

TL;DR

Geometric Autoencoder (GAE) is proposed, a principled framework that systematically addresses challenges of latent diffusion modeling and establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability.

Abstract

Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/sii-research/GAE.
Paper Structure (57 sections, 5 equations, 6 figures, 14 tables, 1 algorithm)

This paper contains 57 sections, 5 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of GAE performance.Left: GAE establishes a superior Pareto curve between linear probing accuracy and latent dimension. Right: GAE with 32 latent dimension significantly accelerates convergence and delivers superior generation results at both 80 and 800 training epochs.
  • Figure 2: Overview of the GAE architecture. The input image is processed through a pixel-level branch ($E_p$, $A_p$) and a frozen semantic branch (VFM, $E_{sp}$). A Semantic Preservation loss $L_{sp}$ aligns the latent mean with the features from the decoupled Semantic Teacher.
  • Figure 3: Illustration of the three latent alignment paradigms: The term $L_{sp}$ aligns the AE representations with those of the VFM. Pre Alignment aligns high-dimensional encoder features directly; Post Alignment projects AE latents back to a high-dimensional space for supervision; Latent Alignment operates within the compressed latent bottleneck via a projection of VFM features.
  • Figure 4: Decoder stability against latent noise injection. We evaluate rFID by adding varying levels of Gaussian noise to the latent representations before decoding. The results demonstrate that models trained with higher $\sigma$ exhibit superior tolerance to latent distribution shifts, ensuring stable performance during generative sampling.
  • Figure 5: Qualitative results. Samples are generated by GAE ($d=32$) after 800 epochs of training. A Classifier-Free Guidance (CFG) scale of $w=3.3$ is utilized to enhance visual fidelity.
  • ...and 1 more figures