Table of Contents
Fetching ...

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

Zhiyuan Wu, Shuai Wang, Li Chen, Kaihui Gao, Dan Li, Yanyu Ren, Qiming Zhang, Yong Wang

TL;DR

Video diffusion models incur prohibitive GPU memory due to 3D spatio-temporal attention, necessitating multi-GPU serving. Latent Parallelism (LP) shifts parallelism from the model to the latent space, using dynamic rotating partitions across temporal, height, and width dimensions to create parallel sub-problems with lightweight latent transfers. Two quality-preserving mechanisms—patch-aligned overlapping partitions and position-aware latent reconstruction—maintain global video coherence during stitching. Theoretical analysis proves 2-completeness of LP and quantifies dramatic communication overhead reductions, while experiments across EvalCrafter, T2V-CompBench, and VBench show up to 97% overhead reduction with comparable video quality, and LP can be integrated as a plug-in with conventional parallelism for scalable VDM serving.

Abstract

Video diffusion models (VDMs) perform attention computation over the 3D spatio-temporal domain. Compared to large language models (LLMs) processing 1D sequences, their memory consumption scales cubically, necessitating parallel serving across multiple GPUs. Traditional parallelism strategies partition the computational graph, requiring frequent high-dimensional activation transfers that create severe communication bottlenecks. To tackle this issue, we exploit the local spatio-temporal dependencies inherent in the diffusion denoising process and propose Latent Parallelism (LP), the first parallelism strategy tailored for VDM serving. \textcolor{black}{LP decomposes the global denoising problem into parallelizable sub-problems by dynamically rotating the partitioning dimensions (temporal, height, and width) within the compact latent space across diffusion timesteps, substantially reducing the communication overhead compared to prevailing parallelism strategies.} To ensure generation quality, we design a patch-aligned overlapping partition strategy that matches partition boundaries with visual patches and a position-aware latent reconstruction mechanism for smooth stitching. Experiments on three benchmarks demonstrate that LP reduces communication overhead by up to 97\% over baseline methods while maintaining comparable generation quality. As a non-intrusive plug-in paradigm, LP can be seamlessly integrated with existing parallelism strategies, enabling efficient and scalable video generation services.

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

TL;DR

Video diffusion models incur prohibitive GPU memory due to 3D spatio-temporal attention, necessitating multi-GPU serving. Latent Parallelism (LP) shifts parallelism from the model to the latent space, using dynamic rotating partitions across temporal, height, and width dimensions to create parallel sub-problems with lightweight latent transfers. Two quality-preserving mechanisms—patch-aligned overlapping partitions and position-aware latent reconstruction—maintain global video coherence during stitching. Theoretical analysis proves 2-completeness of LP and quantifies dramatic communication overhead reductions, while experiments across EvalCrafter, T2V-CompBench, and VBench show up to 97% overhead reduction with comparable video quality, and LP can be integrated as a plug-in with conventional parallelism for scalable VDM serving.

Abstract

Video diffusion models (VDMs) perform attention computation over the 3D spatio-temporal domain. Compared to large language models (LLMs) processing 1D sequences, their memory consumption scales cubically, necessitating parallel serving across multiple GPUs. Traditional parallelism strategies partition the computational graph, requiring frequent high-dimensional activation transfers that create severe communication bottlenecks. To tackle this issue, we exploit the local spatio-temporal dependencies inherent in the diffusion denoising process and propose Latent Parallelism (LP), the first parallelism strategy tailored for VDM serving. \textcolor{black}{LP decomposes the global denoising problem into parallelizable sub-problems by dynamically rotating the partitioning dimensions (temporal, height, and width) within the compact latent space across diffusion timesteps, substantially reducing the communication overhead compared to prevailing parallelism strategies.} To ensure generation quality, we design a patch-aligned overlapping partition strategy that matches partition boundaries with visual patches and a position-aware latent reconstruction mechanism for smooth stitching. Experiments on three benchmarks demonstrate that LP reduces communication overhead by up to 97\% over baseline methods while maintaining comparable generation quality. As a non-intrusive plug-in paradigm, LP can be seamlessly integrated with existing parallelism strategies, enabling efficient and scalable video generation services.

Paper Structure

This paper contains 35 sections, 1 theorem, 53 equations, 13 figures, 3 tables.

Key Result

Theorem 1

Latent Parallelism is a 2-complete parallelism strategy. Formally, for any position $p$ in the latent space $Z$, we have $\mathcal{R}(p, 2) = Z$.

Figures (13)

  • Figure 1: Comparison of VDM and LLM in terms of GPU memory consumption.
  • Figure 2: Comparison of different parallelism strategies.
  • Figure 3: Workflow of LP. At each denoising timestep $t$, the global latent is decomposed through (1) dynamic rotating partition, processed via (2) parallel denoising on independent GPUs, and finally unified by (3) latent reconstruction.
  • Figure 4: Comparison of video generation quality across three benchmarks.
  • Figure 5: Visual comparison between Centralized and LP-generated videos. The left, middle, and right columns show the starting, middle, and ending frames of the videos, respectively.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1: Receptive Field
  • Definition 2: Completeness
  • Definition 3: $N$-Completeness
  • Theorem 1: 2-Completeness of LP
  • proof