Training-Free Efficient Video Generation via Dynamic Token Carving

Yuechen Zhang; Jinbo Xing; Bin Xia; Shaoteng Liu; Bohao Peng; Xin Tao; Pengfei Wan; Eric Lo; Jiaya Jia

Training-Free Efficient Video Generation via Dynamic Token Carving

Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia

TL;DR

This work tackles the heavy computational burden of diffusion-based video generation using Diffusion Transformers by introducing Jenga, a training-free inference pipeline that combines dynamic token carving with Progressive Resolution. Block-wise Attention Carving uses a space-filling curve ordering to sparsify attention via block masks, while Progressive Resolution decouples content generation from detail refinement across staged latencies, aided by a Text-Attention Amplifier and case-agnostic timestep skipping. The approach yields up to $8.83\times$ speedups across multiple state-of-the-art DiT models with minimal quality loss and shows strong cross-model generalizability, including T2V, I2V, and distilled variants, on single- and multi-GPU setups. Practically, Jenga enables high-quality, real-time-friendly video generation on modern hardware without retraining, significantly broadening the applicability of diffusion-based video synthesis.

Abstract

Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: https://github.com/dvlab-research/Jenga

Training-Free Efficient Video Generation via Dynamic Token Carving

TL;DR

Abstract

Training-Free Efficient Video Generation via Dynamic Token Carving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)