Table of Contents
Fetching ...

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar

TL;DR

This work presents CMD, a memory- and compute-efficient latent diffusion model for video generation that encodes each video as an image-like content frame plus a low-dimensional motion latent. The content frame distribution is modeled by fine-tuning a pretrained image diffusion model, while a lightweight diffusion model (DiT-based) generates the motion latent conditioned on the content frame and text. This two-stage design leverages rich image-domain priors to improve video quality while drastically reducing FLOPs and memory compared with prior methods, achieving strong FVD scores and fast sampling on high-resolution outputs. The method demonstrates substantial efficiency gains and competitive quality on UCF-101, WebVid-10M, and MSR-VTT, and includes comprehensive ablations and efficiency analyses to validate its components. Limitations include fixed video length and potential improvements in content/motion latent forms, with future work pointing toward longer videos and better latent encodings.

Abstract

Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7$\times$ faster than prior approaches by generating a video of 512$\times$1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

TL;DR

This work presents CMD, a memory- and compute-efficient latent diffusion model for video generation that encodes each video as an image-like content frame plus a low-dimensional motion latent. The content frame distribution is modeled by fine-tuning a pretrained image diffusion model, while a lightweight diffusion model (DiT-based) generates the motion latent conditioned on the content frame and text. This two-stage design leverages rich image-domain priors to improve video quality while drastically reducing FLOPs and memory compared with prior methods, achieving strong FVD scores and fast sampling on high-resolution outputs. The method demonstrates substantial efficiency gains and competitive quality on UCF-101, WebVid-10M, and MSR-VTT, and includes comprehensive ablations and efficiency analyses to validate its components. Limitations include fixed video length and potential improvements in content/motion latent forms, with future work pointing toward longer videos and better latent encodings.

Abstract

Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7 faster than prior approaches by generating a video of 5121024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.
Paper Structure (23 sections, 8 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 15 figures, 11 tables, 1 algorithm.

Figures (15)

  • Figure 1: Existing (text-to-)video diffusion models extended from image diffusion models often suffer from computation and memory inefficiency due to extremely high-dimensionality and temporal redundancy of video frames. Compared with these methods, CMD requires $\sim$16.7$\times$ less computation with only $\sim$66% GPU memory usage in sampling, while achieving significantly better video generation quality. FLOPs and memory consumption are measured with a single NVIDIA A100 40GB GPU to generate a single video of a resolution 512$\times$1024 and length 16.
  • Figure 1: Class-conditional video generation on UCF-101. # denotes the model also uses the test split for training. $\downarrow$ indicates lower values are better. Bolds indicate the best results, and we mark our method by blue. We mark the method by * if the score is evaluated with 10,000 real data and generated videos, otherwise we use 2,048 videos. For a zero-shot setup, we report the dataset size used for training.
  • Figure 2: Comparison with (a) the conventional extension of image diffusion models for video generation and (b) our CMD. We mark the newly added parameters as blue. Unlike common approaches that directly add temporal layers in pretrained image diffusion models for extension, CMD encodes each video as an image-like content frame and motion latents, and then fine-tunes a pretrained image diffusion model (e.g., Stable Diffusion rombach2021highresolution) for content frame generation and trains a new lightweight diffusion model (e.g., DiT Peebles2022DiT) for motion generation.
  • Figure 3: 512$\times$1024 resolution, 16-frame text-to-video generation results from our CMD. We visualize video frames with a stride of 5. We provide more examples with different text prompts in Appendix \ref{['appen:more_qual']}, as well as their illustrations as video file formats in the supplementary material.
  • Figure 4: Illustration of our autoencoder. Encoder: We compute relative importance of all frames (blue) for a content frame and motion latent representation. Decoder: Using the content frame and motion latent representation, we construct a cubic tensor for video network to reconstruct the video.
  • ...and 10 more figures