Table of Contents
Fetching ...

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Xunzhi Xiang, Xingye Tian, Guiyu Zhang, Yabo Chen, Shaofeng Zhang, Xuebo Wang, Xin Tao, Qi Fan

TL;DR

The paper addresses the reconstruction–generation trade-off in high-dimensional ViT-based VAEs used with diffusion models, identifying input-independent high-frequency noise as the root cause of optimization difficulty. It introduces Denoising-VAE with Multi-Level Spectral Regularization to denoise latents while preserving perceptual reconstruction, and Frequency-Domain Diffusion Alignment to guide diffusion training. On ImageNet 256×256, the approach achieves $rFID=0.28$, $PSNR=27.26$, and $gFID=1.82$, with a 32-channel VAE enabling nearly $2\times$ faster diffusion convergence and up to $5.75\times$ GFLOPs reduction versus SD-VAE. These results demonstrate that spectral denoising can improve training stability and generation quality without external VFMs, offering a scalable, VFM-free alternative for high-resolution latent diffusion.

Abstract

Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2$\times$ faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256$\times$256 benchmark.

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

TL;DR

The paper addresses the reconstruction–generation trade-off in high-dimensional ViT-based VAEs used with diffusion models, identifying input-independent high-frequency noise as the root cause of optimization difficulty. It introduces Denoising-VAE with Multi-Level Spectral Regularization to denoise latents while preserving perceptual reconstruction, and Frequency-Domain Diffusion Alignment to guide diffusion training. On ImageNet 256×256, the approach achieves , , and , with a 32-channel VAE enabling nearly faster diffusion convergence and up to GFLOPs reduction versus SD-VAE. These results demonstrate that spectral denoising can improve training stability and generation quality without external VFMs, offering a scalable, VFM-free alternative for high-resolution latent diffusion.

Abstract

Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving reconstruction quality. The resulting Denoising-VAE, a ViT-based autoencoder that does not rely on VFMs, produces cleaner, lower-noise latents, leading to improved generative quality and faster optimization convergence. We further introduce a spectral alignment strategy to facilitate the optimization of Denoising-VAE-based generative models. Our complete method enables diffusion models to converge approximately 2 faster than with SD-VAE, while achieving state-of-the-art reconstruction quality (rFID = 0.28, PSNR = 27.26) and competitive generation performance (gFID = 1.82) on the ImageNet 256256 benchmark.

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Frequency-domain analysis of ViT-based VAE latent spaces and comparison with convolutional baselines. (a) Encoding a uniform-color image with a ViT-based tokenizer yields structured high-frequency artifacts in the latent space, revealing input-independent noise injected during tokenization. (b) As latent dimensionality increases, PCA projections of ViT-based latents become increasingly noisy and spatially unstable, indicating the amplification of undesired high-frequency variation. (c) In contrast, Conventional VAEs show strong dependence on high-frequency latent signals, where reconstruction fidelity is tightly coupled to noise patterns.
  • Figure 2: Per-image latent denoising with Denoising-VAE. After Spectral Regularization reveal smoother, more coherent structure.
  • Figure 3: FID comparisons with vanilla SiT across different VAE settings on ImageNet $256\times256$ without CFG. Introducing denoising and frequency-domain alignment consistently improves convergence speed and generation quality across all latent dimensionalities.
  • Figure 4: Training losses of tokenizers under different settings.
  • Figure 5: Visualization Results. We visualize our latent diffusion system with proposed Denoising-VAE together with SiT-XL trained on ImageNet 256 × 256 resolution using classifier-free guidance with $\omega$ = 1.8.