H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models
Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov
TL;DR
The paper tackles the computational bottleneck of video diffusion by designing H3AE, a high-compression, high-speed video VAE. It integrates a carefully crafted micro- and macro-architecture with omni-objective training and a latent consistency loss to achieve real-time decoding on mobile and strong reconstruction quality at large compression ratios. Importantly, it demonstrates that a single VAE can support both plain T2V and I2V generation, and validates diffusion usability by training a DiT on the VAE's latent space, achieving faster inference and high-quality video synthesis. The findings challenge reliance on traditional auxiliary losses (LPIPS, GAN, DWT) and offer practical pathways for accessible, efficient video generation on consumer devices. Limitations include hardware/data access and potential misuse, which the authors acknowledge and contextualize.
Abstract
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.
