PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Sijie Wang; Qiang Wang; Shaohuai Shi

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Sijie Wang, Qiang Wang, Shaohuai Shi

TL;DR

PipeDiT tackles the latency and memory bottlenecks of diffusion-transformer video generation by integrating three system-level innovations: PipeSP for overlapped sequence parallelism, DeDiVAE to decouple diffusion from VAE decoding across two GPU groups, and Attention Co-processing to utilize idle decoding GPUs for attention computations. Together, these approaches reduce peak memory usage and significantly accelerate end-to-end inference while preserving output quality, as demonstrated on OpenSoraPlan and HunyuanVideo across multiple resolutions and timesteps. The work provides both theoretical justifications (alignment proofs and workload-balancing formulas) and extensive empirical results showing up to 4.02× speedups and notable memory efficiency gains, with robustness across hardware configurations. This approach offers a scalable path for deploying high-quality DiT-based video generation in production environments and can accommodate future model expansions such as Mixture-of-Experts architectures.

Abstract

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

TL;DR

Abstract

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)