Table of Contents
Fetching ...

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

TL;DR

EQ-VAE tackles the problem that autoencoder latent spaces are not equivariant to spatial transformations, which increases the burden on downstream latent generative models. It introduces an equivariance-regularized objective that can be applied by fine-tuning pre-trained autoencoders, using an implicit regularization that aligns transformed latents with transformed inputs. The approach improves generative metrics across multiple models (DiT, SiT, REPA, MaskGIT) and accelerates training by up to several-fold, while preserving reconstruction quality. This plug-and-play method is compatible with both continuous and discrete autoencoders, offering practical benefits for a wide range of latent diffusion and masked generation systems.

Abstract

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

TL;DR

EQ-VAE tackles the problem that autoencoder latent spaces are not equivariant to spatial transformations, which increases the burden on downstream latent generative models. It introduces an equivariance-regularized objective that can be applied by fine-tuning pre-trained autoencoders, using an implicit regularization that aligns transformed latents with transformed inputs. The approach improves generative metrics across multiple models (DiT, SiT, REPA, MaskGIT) and accelerates training by up to several-fold, while preserving reconstruction quality. This plug-and-play method is compatible with both continuous and discrete autoencoders, offering practical benefits for a wide range of latent diffusion and masked generation systems.

Abstract

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.

Paper Structure

This paper contains 42 sections, 7 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Latent Space Structure (Left) Top three principal components of SD-VAE and SDXL-VAE, with and without EQ-VAE, demonstrating visually that our regularization produces smoother latent representations without compromising reconstruction (See \ref{['tab:comp_auto']}). Accelerated Training (Right) Training curves (without classifier-free guidance) for DiT-XL/2 and REPA (w/ SiT-XL/2), showing that our EQ-VAE accelerates convergence by $\times 7$ and $\times 4$, respectively.
  • Figure 2: Latent Space Equivariance. Reconstructed images using SD-VAErombach2022high and our EQ-VAE when applying scaling transformation $\tau$, with factor $s=0.5$, to the input images $\mathcal{D}(\mathcal{E}(\tau \circ \mathbf{x}))$ versus directly to the latent representations $\mathcal{D}(\tau \circ \mathcal{E}(\mathbf{x}))$. Our approach preserves reconstruction quality under latent transformations, whereas SD-VAE exhibits significant degradation. See \ref{['fig:qualitative-equivariance-appendix']} for additional examples.
  • Figure 3: Enhanced Reconstruction under Latent Transformations. Reconstruction rFID measured between $\tau \circ \mathbf{x}$ and $\mathcal{D}(\tau \circ \mathcal{E}(\mathbf{x}))$ for various spatial transformations. We consider scaling transforms with factors $s = 0.75, 0.50, 0.25$ and also measure the average rFID over rotation angles $\theta = \frac{\pi}{2}, \pi, \frac{3\pi}{2}$. Results for SD-VAErombach2022high and SDXL-VAEpodell2024sdxl, with and without EQ-VAE. Our approach significantly reduces rFID compared to baselines, improving image fidelity under latent transformations. For readability, we show $\left\lfloor{\textsc{rFID}}\right\rfloor$.
  • Figure 4: EQ-VAE accelerates generative modeling. We compare results from two DiT-XL/2 models at 50K, 100K, and 400K iterations, one trained with SD-VAE-FT-EMA(top) and with EQ-VAE(bottom). The same noise and number of sampling steps are used for both models, without classifier-free guidance. Our approach delivers faster improvements in image quality, demonstrating accelerated convergence.
  • Figure 5: Rapid Improvement via EQ-VAE Fine-tuning. Even a single epoch of EQ-VAE fine-tuning significantly improves generative modeling performance, reducing gFID from 43.5 to 36.7. Generative modeling with DiT-B/2.
  • ...and 3 more figures