Table of Contents
Fetching ...

Large Motion Video Autoencoding with Cross-modal Video VAE

Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen

TL;DR

This work introduces a cross-modal Video VAE that decouples spatial and temporal compression to reduce motion blur and temporal artifacts. It implements a two-stage spatiotemporal model: a temporal-aware spatial encoder followed by a lightweight temporal autoencoder, augmented with cross-modal text guidance and joint image-video training. The approach achieves state-of-the-art reconstruction quality across challenging benchmarks, including large-motion sequences, and enables efficient latent representations for downstream video generation. The integration of text conditioning and cross-modal training expands the model's versatility, enabling both high-fidelity video decoding and improved image compression within a unified framework.

Abstract

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.

Large Motion Video Autoencoding with Cross-modal Video VAE

TL;DR

This work introduces a cross-modal Video VAE that decouples spatial and temporal compression to reduce motion blur and temporal artifacts. It implements a two-stage spatiotemporal model: a temporal-aware spatial encoder followed by a lightweight temporal autoencoder, augmented with cross-modal text guidance and joint image-video training. The approach achieves state-of-the-art reconstruction quality across challenging benchmarks, including large-motion sequences, and enables efficient latent representations for downstream video generation. The integration of text conditioning and cross-modal training expands the model's versatility, enabling both high-fidelity video decoding and improved image compression within a unified framework.

Abstract

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.

Paper Structure

This paper contains 26 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our reconstruction results compared with a line of three recent strong baseline approaches. The ground truth frame is (0). Our model significantly outperforms previous methods, especially under large motion scenarios such as people doing sports.
  • Figure 2: Comparison of our optimal spatiotemporal modeling and the two other options. Simultaneous modeling is achieved by inflating pre-trained 2D spatial VAE to 3D VAE. Sequential modeling indicates first compressing the spatial dimension with a spatial encoder and then compressing the temporal information with a temporal encoder. We identify the issues of these two options and propose to combine both advantages and achieve a much better video reconstruction quality. Our VAE also benefits from cross-modality, i.e., text information.
  • Figure 3: The architecture of our temporal-aware spatial autoencoder. We expand the 2D convolution of SD VAE rombach2022high to 3D convolution and append one additional 3D convolution as temporal convolution after the expanded 3D convolution, which forms the STBlock3D. We also inject the cross-attention layers for cross-modal learning with textual conditions.
  • Figure 4: Comparisons among simultaneous spatiotemporal modeling, sequential spatiotemporal modeling and our proposed solution.
  • Figure 5: The effectiveness of the cross-modal learning for our video VAE. The introduction of textural information improves the detail recovery. We visualize the learned attention map using keywords of the input prompts.
  • ...and 1 more figures