Table of Contents
Fetching ...

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

Sijie Wang, Qiang Wang, Shaohuai Shi

TL;DR

PipeDiT tackles the latency and memory bottlenecks of diffusion-transformer video generation by integrating three system-level innovations: PipeSP for overlapped sequence parallelism, DeDiVAE to decouple diffusion from VAE decoding across two GPU groups, and Attention Co-processing to utilize idle decoding GPUs for attention computations. Together, these approaches reduce peak memory usage and significantly accelerate end-to-end inference while preserving output quality, as demonstrated on OpenSoraPlan and HunyuanVideo across multiple resolutions and timesteps. The work provides both theoretical justifications (alignment proofs and workload-balancing formulas) and extensive empirical results showing up to 4.02× speedups and notable memory efficiency gains, with robustness across hardware configurations. This approach offers a scalable path for deploying high-quality DiT-based video generation in production environments and can accommodate future model expansions such as Mixture-of-Experts architectures.

Abstract

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

TL;DR

PipeDiT tackles the latency and memory bottlenecks of diffusion-transformer video generation by integrating three system-level innovations: PipeSP for overlapped sequence parallelism, DeDiVAE to decouple diffusion from VAE decoding across two GPU groups, and Attention Co-processing to utilize idle decoding GPUs for attention computations. Together, these approaches reduce peak memory usage and significantly accelerate end-to-end inference while preserving output quality, as demonstrated on OpenSoraPlan and HunyuanVideo across multiple resolutions and timesteps. The work provides both theoretical justifications (alignment proofs and workload-balancing formulas) and extensive empirical results showing up to 4.02× speedups and notable memory efficiency gains, with robustness across hardware configurations. This approach offers a scalable path for deploying high-quality DiT-based video generation in production environments and can accommodate future model expansions such as Mixture-of-Experts architectures.

Abstract

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remark- able capabilities. However, their practical deployment is of- ten hindered by slow inference speeds and high memory con- sumption. In this paper, we propose a novel pipelining frame- work named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and commu- nication among multiple GPUs to be pipelined, thus reduc- ing inference latency. Second, we propose DeDiVAE to de- couple the diffusion module and the variational autoencoder (VAE) module into two GPU groups, whose executions can also be pipelined to reduce memory consumption and infer- ence latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and Hun- yuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8- GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves 1.06x to 4.02x speedups over OpenSoraPlan and HunyuanVideo.

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Text-to-video generation starts with encoding the input text and a pure noise latent into a semantic representation, which guides a diffusion model to iteratively refine a latent. The refined latent is then upsampled by a VAE decoder to generate the final video.
  • Figure 2: Latency and peak GPU memory usage of each component during inference (using eight GPUs with SP) for a single prompt in (a) OpenSoraPlan pku-yuangroup2025opensora model with a resolution of 480×352×65 and 50 timesteps (b) HunyuanVideo kong2024hunyuanvideo model with a resolution of 256×128×33 and 50 timesteps.
  • Figure 3: (a) The execution process of Ulysses, where computation and communication are executed sequentially. (b) Our optimized SP (PipeSP) by pipelining communication and computation. The subsequent post-processing resolves the misalignment issue introduced by the pipelining.
  • Figure 4: In the prompt 1 stage, the Denoising GPUs transmit the computed Q, K, and V tensors to the Decoding GPUs, enabling parallel attention computation across both groups. In the prompt 2 stage, the Denoising GPUs perform attention computation independently, while the Decoding GPUs execute decoding in parallel.
  • Figure 5: The heatmap of the latency difference between the two methods: (1) PipeDiT w/o Aco and (2) PipeDiT w/ Aco.
  • ...and 1 more figures