Table of Contents
Fetching ...

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Zhongjie Duan, Wenmeng Zhou, Cen Chen, Yaliang Li, Weining Qian

TL;DR

ExVideo tackles the challenge of extending video diffusion models to longer sequences without prohibitive compute by introducing a post-tuning framework that mounts adapters on temporal modules. The method extends 3D convolution, temporal attention, and positional embeddings in Stable Video Diffusion, while freezing non-temporal parameters to preserve generalization. On the OpenSoraPlan dataset (~40k videos), ExVideo extends generation to 128 frames (roughly five times the baseline) with about 1.5k GPU hours, maintaining quality and diverse styles. The work demonstrates memory-efficient training through common optimizations and highlights practical pipelines by integrating text-to-image-to-video components; the code and extended model are slated for public release.

Abstract

Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to $5\times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

TL;DR

ExVideo tackles the challenge of extending video diffusion models to longer sequences without prohibitive compute by introducing a post-tuning framework that mounts adapters on temporal modules. The method extends 3D convolution, temporal attention, and positional embeddings in Stable Video Diffusion, while freezing non-temporal parameters to preserve generalization. On the OpenSoraPlan dataset (~40k videos), ExVideo extends generation to 128 frames (roughly five times the baseline) with about 1.5k GPU hours, maintaining quality and diverse styles. The work demonstrates memory-efficient training through common optimizations and highlights practical pipelines by integrating text-to-image-to-video components; the code and extended model are slated for public release.

Abstract

Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.
Paper Structure (16 sections, 5 figures)

This paper contains 16 sections, 5 figures.

Figures (5)

  • Figure 1: The architecture of extended temporal blocks in Stable Video Diffusion. We replace the static positional embedding with trainable positional embedding and add an adaptive identity 3D convolution layer to learn long-term video features. The modifications are adaptive, preserving the original generalization abilities of the pre-trained model. All parameters outside the temporal block are fixed while training for lower memory usage.
  • Figure 2: Examples in different styles generated by our Extended Stable Video Diffusion, where the first frame is generated by Stale Diffusion 3. The prompt is "A beautiful coastal beach in spring, waves lapping on sand", followed by the description of style.
  • Figure 3: Video examples in different training phases. The first frame is generated by Hunyuan DiT, and the prompt is "sunset, mountains, clouds". We present the optical flow to visualize the motion, where pixels with similar colors are moving in similar directions.
  • Figure 4: Video examples in various resolutions. The first frame is generated by Stable Diffusion 3, and the prompt is "bonfire, on the stone".
  • Figure 5: Visual comparisons of text-to-video results from several existing video synthesis models and our Extended model. The prompts are "a boat sailing smoothly on a calm lake" and "an astronaut flying in space, Van Gogh style". In our pipeline, the first frame is generated by Hunyuan DiT, and our extended Stable Video Diffusion generates the video according to the first frame.