Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

Zhiyuan Wu; Shuai Wang; Li Chen; Kaihui Gao; Dan Li; Yanyu Ren; Qiming Zhang; Yong Wang

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

Zhiyuan Wu, Shuai Wang, Li Chen, Kaihui Gao, Dan Li, Yanyu Ren, Qiming Zhang, Yong Wang

TL;DR

Video diffusion models incur prohibitive GPU memory due to 3D spatio-temporal attention, necessitating multi-GPU serving. Latent Parallelism (LP) shifts parallelism from the model to the latent space, using dynamic rotating partitions across temporal, height, and width dimensions to create parallel sub-problems with lightweight latent transfers. Two quality-preserving mechanisms—patch-aligned overlapping partitions and position-aware latent reconstruction—maintain global video coherence during stitching. Theoretical analysis proves 2-completeness of LP and quantifies dramatic communication overhead reductions, while experiments across EvalCrafter, T2V-CompBench, and VBench show up to 97% overhead reduction with comparable video quality, and LP can be integrated as a plug-in with conventional parallelism for scalable VDM serving.

Abstract

Video diffusion models (VDMs) perform attention computation over the 3D spatio-temporal domain. Compared to large language models (LLMs) processing 1D sequences, their memory consumption scales cubically, necessitating parallel serving across multiple GPUs. Traditional parallelism strategies partition the computational graph, requiring frequent high-dimensional activation transfers that create severe communication bottlenecks. To tackle this issue, we exploit the local spatio-temporal dependencies inherent in the diffusion denoising process and propose Latent Parallelism (LP), the first parallelism strategy tailored for VDM serving. \textcolor{black}{LP decomposes the global denoising problem into parallelizable sub-problems by dynamically rotating the partitioning dimensions (temporal, height, and width) within the compact latent space across diffusion timesteps, substantially reducing the communication overhead compared to prevailing parallelism strategies.} To ensure generation quality, we design a patch-aligned overlapping partition strategy that matches partition boundaries with visual patches and a position-aware latent reconstruction mechanism for smooth stitching. Experiments on three benchmarks demonstrate that LP reduces communication overhead by up to 97\% over baseline methods while maintaining comparable generation quality. As a non-intrusive plug-in paradigm, LP can be seamlessly integrated with existing parallelism strategies, enabling efficient and scalable video generation services.

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

TL;DR

Abstract

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (5)