Table of Contents
Fetching ...

VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, Denis Dimitrov

TL;DR

VIVAT tackles persistent artifacts in KL-VAE training by mapping five common issues to concrete, easily implementable remedies that avoid radical architectural changes. By adjusting loss weights, padding schemes, and introducing Spatially Conditional Normalization, the approach yields superior reconstruction (PSNR/SSIM) and improved text-to-image quality (CLIP) within latent-diffusion pipelines. The work presents a practical, scalable path to robust VAE training, demonstrating both artifact mitigation and compatibility with high-level generation tasks. Overall, VIVAT highlights the continued effectiveness of classical autoencoding frameworks when complemented with targeted optimization strategies.

Abstract

Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.

VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

TL;DR

VIVAT tackles persistent artifacts in KL-VAE training by mapping five common issues to concrete, easily implementable remedies that avoid radical architectural changes. By adjusting loss weights, padding schemes, and introducing Spatially Conditional Normalization, the approach yields superior reconstruction (PSNR/SSIM) and improved text-to-image quality (CLIP) within latent-diffusion pipelines. The work presents a practical, scalable path to robust VAE training, demonstrating both artifact mitigation and compatibility with high-level generation tasks. Overall, VIVAT highlights the continued effectiveness of classical autoencoding frameworks when complemented with targeted optimization strategies.

Abstract

Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.

Paper Structure

This paper contains 42 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: VAE architecture.
  • Figure 2: Reconstruction artifacts resulting from the unimproved VAE model.
  • Figure 3: Droplet artifact formation. Activation norms grow on several consecutive Decoder layers.
  • Figure 4: The results of our methods for addressing VAE reconstruction artifacts. Our proposed approach effectively eliminates many problems, ensuring high-quality reconstructions.
  • Figure 5: The results of image reconstruction using different models. A zoom-in shows that our approach, based on simple heuristics, leads to the results compared to state-of-the-art models. By artifacts mitigation, it is possible to achieve higher quality in the reconstruction of small details and text.
  • ...and 5 more figures