Table of Contents
Fetching ...

H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov

TL;DR

The paper tackles the computational bottleneck of video diffusion by designing H3AE, a high-compression, high-speed video VAE. It integrates a carefully crafted micro- and macro-architecture with omni-objective training and a latent consistency loss to achieve real-time decoding on mobile and strong reconstruction quality at large compression ratios. Importantly, it demonstrates that a single VAE can support both plain T2V and I2V generation, and validates diffusion usability by training a DiT on the VAE's latent space, achieving faster inference and high-quality video synthesis. The findings challenge reliance on traditional auxiliary losses (LPIPS, GAN, DWT) and offer practical pathways for accessible, efficient video generation on consumer devices. Limitations include hardware/data access and potential misuse, which the authors acknowledge and contextualize.

Abstract

Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

TL;DR

The paper tackles the computational bottleneck of video diffusion by designing H3AE, a high-compression, high-speed video VAE. It integrates a carefully crafted micro- and macro-architecture with omni-objective training and a latent consistency loss to achieve real-time decoding on mobile and strong reconstruction quality at large compression ratios. Importantly, it demonstrates that a single VAE can support both plain T2V and I2V generation, and validates diffusion usability by training a DiT on the VAE's latent space, achieving faster inference and high-quality video synthesis. The findings challenge reliance on traditional auxiliary losses (LPIPS, GAN, DWT) and offer practical pathways for accessible, efficient video generation on consumer devices. Limitations include hardware/data access and potential misuse, which the authors acknowledge and contextualize.

Abstract

Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

Paper Structure

This paper contains 20 sections, 3 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Compression ratio is in $log_2$ scale. H3AE achieves a better compression–PSNR trade-off and is faster and more parameter-efficient. Refer \ref{['table:compare sotas']} for more benchmarks.
  • Figure 2: Overview of H3AE architecture, omni-objective training, and Latent Consistency Loss. When computing $z^\prime$ in \ref{['equ:cycle loss']}, the encoder weights remain frozen. For omni-objective training, we randomly pass the hierarchical features of the first frame from the encoder to the decoder, and use addition by default for feature fusion. As in the right, a block-shaped causal mask is applied to the 3D Transformer to enforce the causality of the attention mechanism, ensuring proper temporal dependencies in the generated representations.
  • Figure 3: AE Qualitative Results. Reconstructions from our H3AE ($8\times32\times32$) and other high compression autoencoders: Cosmos-Tokenizer CosmosTokenizer ($8\times16\times16$), LTX-VAE LTX-video ($8\times32\times32$). We show zoomed-in results to highlight the differences in fidelity and quality. Our method features greater high-frequency detail. GT refers to the ground truth video.
  • Figure 4: T2V Qualitative Results. Examples of videos generated by a 2B DiT denoiser, trained on the latent space of our $8\times32\times32$ H3AE.
  • Figure 5: Quality comparison of reconstruction results of H3AE between plain-T2V VAE and I2V VAE settings. The results shows that I2V VAE delivers better high-frequency details.
  • ...and 1 more figures