Table of Contents
Fetching ...

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan

TL;DR

OD-VAE introduces omni-dimensional video compression for latent video diffusion models by applying a 3D-causal-CNN to simultaneously compress temporal and spatial information. It presents four variants to balance reconstruction quality and compression speed, plus a tail initialization and temporal tiling strategy to speed training and enable long-video processing. Empirical results show competitive reconstruction performance against SD-VAE/SVD-VAE and superior LVDM efficiency, with notable memory and speed gains. These contributions enhance the practicality of LVDMs for high-resolution, long-duration video generation under limited resources.

Abstract

Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

TL;DR

OD-VAE introduces omni-dimensional video compression for latent video diffusion models by applying a 3D-causal-CNN to simultaneously compress temporal and spatial information. It presents four variants to balance reconstruction quality and compression speed, plus a tail initialization and temporal tiling strategy to speed training and enable long-video processing. Empirical results show competitive reconstruction performance against SD-VAE/SVD-VAE and superior LVDM efficiency, with notable memory and speed gains. These contributions enhance the practicality of LVDMs for high-resolution, long-duration video generation under limited resources.

Abstract

Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.
Paper Structure (13 sections, 5 equations, 4 figures, 4 tables)

This paper contains 13 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The overview of our OD-VAE. It adopts 3D-causal-CNN architecture to temp-spatially compress videos into concise latent representations and can reconstruct them accurately. This greatly enhances the efficiency of LVDMs.
  • Figure 2: Four variants of our OD-VAE. Variant 1: inflating all the 2D convolutions in SD VAE to 3D convolutions. Variant 2: replacing half of the 3D convolutions in variant 1 with 2D convolutions. Variant 3: replacing the 3D convolutions in the outer blocks of variant 1's encoder and decoder with 2D convolutions. Variant 4: replacing the 3D convolutions in the outer blocks of variant 1's encoder with 2D convolutions.
  • Figure 3: Video generation results of LVDMs with different VAEs on the SkyTimelapse dataset. As the figure shows, with OD-VAE, LVDM can generate more realistic and high-quality videos.
  • Figure 4: (a), (b) are the PSNR and LPIPS of the four variants on the WebVid-10M validation set. (c) is the FVD of the four variants on the UCF101 dataset. (d), (e) are the PSNR and LPIPS of the three initialization methods on the WebVid-10M validation set. (f) is the FVD of the three initialization methods on the UCF101 dataset.