Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation
Alexey Buzovkin, Evgeny Shilov
TL;DR
This work tackles the decoder bottleneck in latent-diffusion pipelines for image and video generation by introducing lightweight decoders based on Vision Transformer and Taming Transformer architectures to replace large VAE decoders. The authors design two primary decoders, TAE-192 and EfficientViT, plus a temporal variant for video, and train them on large-scale datasets to achieve substantial speedups (up to about two times faster at high resolutions and up to twenty times faster for video decoding in some configurations) with moderate reductions in perceptual quality. They validate the approach on COCO2017 and UCF101 using standard image and video metrics (SSIM, PSNR, FID) and augment video evaluation with VideoMAE V2 embeddings to capture temporal fidelity, reporting favorable speed/efficiency trade-offs for large-scale inference. The results suggest meaningful practical gains for real-time or large-scale diffusion tasks, with future work pointing to dual masking, temporal modeling, and integration with larger vision models to further improve scalability and quality.
Abstract
We investigate methods to reduce inference time and memory footprint in stable diffusion models by introducing lightweight decoders for both image and video synthesis. Traditional latent diffusion pipelines rely on large Variational Autoencoder decoders that can slow down generation and consume considerable GPU memory. We propose custom-trained decoders using lightweight Vision Transformer and Taming Transformer architectures. Experiments show up to 15% overall speed-ups for image generation on COCO2017 and up to 20 times faster decoding in the sub-module, with additional gains on UCF-101 for video tasks. Memory requirements are moderately reduced, and while there is a small drop in perceptual quality compared to the default decoder, the improvements in speed and scalability are crucial for large-scale inference scenarios such as generating 100K images. Our work is further contextualized by advances in efficient video generation, including dual masking strategies, illustrating a broader effort to improve the scalability and efficiency of generative models.
