CascadeV: An Implementation of Wurstchen Architecture for Video Generation

Wenfeng Lin; Jiangchuan Wei; Boyuan Liu; Yichen Zhang; Shiyue Yan; Mingyu Guo

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo

TL;DR

CascadeV tackles the computational bottleneck of diffusion-based video generation by cascading a base T2V model with a Latent Diffusion Model-based VAE (LDM-VAE). It introduces a grid-based 3D attention mechanism to jointly model spatial and temporal information and removes text cross-attention to focus the DiT stage on high-frequency refinement, achieving a latent compression of $32:1$ and producing 2K video outputs. The approach can be cascaded with existing T2V models to yield up to a $4x$ improvement in resolution or FPS, enabling post-processing upgrades to current outputs. Experiments on Intern4k demonstrate competitive reconstruction quality and strong video quality metrics, validating the cascade strategy's effectiveness for efficient high-resolution T2V generation.

Abstract

Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

TL;DR

Abstract

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)