Table of Contents
Fetching ...

Training-Free Efficient Video Generation via Dynamic Token Carving

Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia

TL;DR

This work tackles the heavy computational burden of diffusion-based video generation using Diffusion Transformers by introducing Jenga, a training-free inference pipeline that combines dynamic token carving with Progressive Resolution. Block-wise Attention Carving uses a space-filling curve ordering to sparsify attention via block masks, while Progressive Resolution decouples content generation from detail refinement across staged latencies, aided by a Text-Attention Amplifier and case-agnostic timestep skipping. The approach yields up to $8.83\times$ speedups across multiple state-of-the-art DiT models with minimal quality loss and shows strong cross-model generalizability, including T2V, I2V, and distilled variants, on single- and multi-GPU setups. Practically, Jenga enables high-quality, real-time-friendly video generation on modern hardware without retraining, significantly broadening the applicability of diffusion-based video synthesis.

Abstract

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

Training-Free Efficient Video Generation via Dynamic Token Carving

TL;DR

This work tackles the heavy computational burden of diffusion-based video generation using Diffusion Transformers by introducing Jenga, a training-free inference pipeline that combines dynamic token carving with Progressive Resolution. Block-wise Attention Carving uses a space-filling curve ordering to sparsify attention via block masks, while Progressive Resolution decouples content generation from detail refinement across staged latencies, aided by a Text-Attention Amplifier and case-agnostic timestep skipping. The approach yields up to speedups across multiple state-of-the-art DiT models with minimal quality loss and shows strong cross-model generalizability, including T2V, I2V, and distilled variants, on single- and multi-GPU setups. Practically, Jenga enables high-quality, real-time-friendly video generation on modern hardware without retraining, significantly broadening the applicability of diffusion-based video synthesis.

Abstract

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83 speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

Paper Structure

This paper contains 50 sections, 5 equations, 18 figures, 13 tables, 4 algorithms.

Figures (18)

  • Figure 1: Jenga generates high-quality videos with an efficient DiT inference pipeline. (a): Extremely sparse attention can preserve details in generated videos. (b): We minimize token interactions via dynamic sparse attention with a progressive resolution design. We present videos generated by Jenga (sub-sampled 48 frames) among different models, marked with the DiT latency and relative speedup rate. Please use Adobe Acrobat Reader for a live video visualization.
  • Figure 2: Overview of Jenga. The left part illustrates the attention carving. A 3D video latent is partitioned into local blocks before being passed to the Transformer layers. A block-wise attention is processed to get a head-aware sparse block-selection masks. In each selected block, dense parallel attention is performed. The right part illustrates the Progressive Resolution strategy. The number of tokens and timesteps is compressed to ensure an efficient generation.
  • Figure 3: Attention Carving (AttenCarve). Here we illustrate a toy example of a $4\times4\times4$ latent, where $m=8$ latent items form a block. Left: The latent 3D re-ordering and block partition via space filling curves (SFC). Right: After the block-wise attention in \ref{['eq:block_atten']}, we can construct the Importance Mask, combined with the pre-computed Condition Mask and Adjacency Mask, a block-wise dense attention mask is passed to the customized kernel for device-efficient attention.
  • Figure 4: Progressive Resolusion (ProRes). Left: A brief illustration of stage switch and timestep skip. Before the rescale in stage $s$, we revert the latent to a clean state $\hat{x}^{s}_0$, then re-noise on the upsampled clean latent. Right & Bottom: We add a bias on the video-text attention score, to enable a scalable Field of View (FOV) in low-resolution content generation.
  • Figure 5: Qualitative comparisons. (a): Jenga maintains strong semantic performance while producing high-quality videos. (b): Examples across multiple Jenga settings, we also demonstrate how the text-amplifier stabilizes Field of View (FOV) across different initial resolutions.
  • ...and 13 more figures