Table of Contents
Fetching ...

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo

TL;DR

CascadeV tackles the computational bottleneck of diffusion-based video generation by cascading a base T2V model with a Latent Diffusion Model-based VAE (LDM-VAE). It introduces a grid-based 3D attention mechanism to jointly model spatial and temporal information and removes text cross-attention to focus the DiT stage on high-frequency refinement, achieving a latent compression of $32:1$ and producing 2K video outputs. The approach can be cascaded with existing T2V models to yield up to a $4x$ improvement in resolution or FPS, enabling post-processing upgrades to current outputs. Experiments on Intern4k demonstrate competitive reconstruction quality and strong video quality metrics, validating the cascade strategy's effectiveness for efficient high-resolution T2V generation.

Abstract

Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

TL;DR

CascadeV tackles the computational bottleneck of diffusion-based video generation by cascading a base T2V model with a Latent Diffusion Model-based VAE (LDM-VAE). It introduces a grid-based 3D attention mechanism to jointly model spatial and temporal information and removes text cross-attention to focus the DiT stage on high-frequency refinement, achieving a latent compression of and producing 2K video outputs. The approach can be cascaded with existing T2V models to yield up to a improvement in resolution or FPS, enabling post-processing upgrades to current outputs. Experiments on Intern4k demonstrate competitive reconstruction quality and strong video quality metrics, validating the cascade strategy's effectiveness for efficient high-resolution T2V generation.

Abstract

Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4 increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Samples of CascadeV. Top: video reconstruction (of samples from Open-Sora-Plan v1.1.0 pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109) with high compression ratio. Even at $64:1$ compression ratios, our model is still able to reconstruct high-frequency details. Middle: 4$\times$ resolution enhancement of Open-Sora-Plan v1.1.0 pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109 results. Bottom: 4$\times$ FPS improvement of SVD blattmann2023stable results. By considering the output of existing T2V models as intermediate results, our model can enhance their resolution and FPS respectively.
  • Figure 2: Overall architecture. By cascading the Latent Diffusion Model (LDM), the LDM-VAE can decode the output of the base T2V model with a higher compression ratio.
  • Figure 3: 3D attention with grid. Our model significantly reduce the computational complexity of 3D attention while effectively preserving the interaction of spatiotemporal information by dividing spatiotemporal blocks in a grid manner.
  • Figure 4: Two approaches for generating conditioning in DiT. (a) Utilizing high-compression Semantic compressor, which follows the pernias2023wurstchen approach, results in a relative strong interdependence between the Base T2V Model and the DiT. (b) Implementing resize techniques to achieve higher compression ratio, which decouples the base model from the DiT, theoretically offering greater scalability.
  • Figure 5: Qualitative results. Despite the higher compression ratio of our model, the reconstructed results still exhibit sufficient detail and good temporal consistency.