Table of Contents
Fetching ...

Minute-Long Videos with Dual Parallelisms

Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang

TL;DR

The paper addresses the high latency and memory demands of DiT-based video diffusion models for long videos by introducing DualParal, a distributed inference strategy that parallelizes both temporal frames and model layers via a block-wise denoising scheme. It coupling a FIFO queue with a device pipeline enables asynchronous computation across GPUs, while a KV feature cache and coordinated noise initialization maintain quality and global temporal coherence without extra resource costs. Empirical results show substantial improvements, including up to 6.54× latency reduction and 1.48× memory savings for 1,025-frame videos on 8× RTX 4090 GPUs, enabling minute-long video generation. These contributions offer a practical, scalable approach for long-video diffusion with minimal idle time and broad impact on real-time or large-scale video generation workflows.

Abstract

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.

Minute-Long Videos with Dual Parallelisms

TL;DR

The paper addresses the high latency and memory demands of DiT-based video diffusion models for long videos by introducing DualParal, a distributed inference strategy that parallelizes both temporal frames and model layers via a block-wise denoising scheme. It coupling a FIFO queue with a device pipeline enables asynchronous computation across GPUs, while a KV feature cache and coordinated noise initialization maintain quality and global temporal coherence without extra resource costs. Empirical results show substantial improvements, including up to 6.54× latency reduction and 1.48× memory savings for 1,025-frame videos on 8× RTX 4090 GPUs, enabling minute-long video generation. These contributions offer a practical, scalable approach for long-video diffusion with minimal idle time and broad impact on real-time or large-scale video generation workflows.

Abstract

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54 lower latency and 1.48 lower memory cost on 8RTX 4090 GPUs.

Paper Structure

This paper contains 19 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of DualParal: DualParal partitions video frames into sequential blocks organized in a queue with noise levels increasing from tail to head, and distributes model layers across devices via a device pipeline. By feeding blocks into the pipeline in a reverse order (from tail to head), this block-wise denoising scheme significantly improves efficiency. To further improve performance, DualParal reuses Key-Value (KV) features from the previous block, requiring only the subsequent block to be concatenated. To preserve global consistency, each new block is initialized from a shared noise pool by shuffling noises, excluding the last $\frac{Num_c}{2}$ latents of the last block in queue.
  • Figure 2: Examples of four different noise initializations for Wan2.1 model Wan: (a) uses the complete noise space, (b) uses a subset of the noise space, (c) adds new noise to the original space, and (d) uses the complete noise space with the repetitive noise. The first image shows the standard video generated from the reference noise space, followed by two different orders of noise initialization.
  • Figure 3: Pipeline schedule of DualParal with $N=4$, $T=50$, and $Block_{num}=4$. Blocks are denoised in reverse order, from tail to head in the queue. After diffusion step $T$, the first clean block is popped from the queue, and all remaining blocks shift forward by one position, decrementing their indices accordingly.
  • Figure 4: Scalability analysis in terms of latency and memory cost: (a) and (b) show the scalability of Wan2.1-1.3B (480p) across different methods on a 301-frame video, while (c) and (d) present the scalability of Wan2.1-14B (720p) on a 301-frame video.
  • Figure 5: Comparison of 257-frame videos.
  • ...and 4 more figures