Minute-Long Videos with Dual Parallelisms

Zeqing Wang; Bowen Zheng; Xingyi Yang; Zhenxiong Tan; Yuecong Xu; Xinchao Wang

Minute-Long Videos with Dual Parallelisms

Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang

TL;DR

The paper addresses the high latency and memory demands of DiT-based video diffusion models for long videos by introducing DualParal, a distributed inference strategy that parallelizes both temporal frames and model layers via a block-wise denoising scheme. It coupling a FIFO queue with a device pipeline enables asynchronous computation across GPUs, while a KV feature cache and coordinated noise initialization maintain quality and global temporal coherence without extra resource costs. Empirical results show substantial improvements, including up to 6.54× latency reduction and 1.48× memory savings for 1,025-frame videos on 8× RTX 4090 GPUs, enabling minute-long video generation. These contributions offer a practical, scalable approach for long-video diffusion with minimal idle time and broad impact on real-time or large-scale video generation workflows.

Abstract

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.

Minute-Long Videos with Dual Parallelisms

TL;DR

Abstract

Minute-Long Videos with Dual Parallelisms

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)