FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Guandong Li

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Guandong Li

TL;DR

FastUSP tackles the bottleneck of kernel launch overhead in distributed diffusion model inference by introducing a multi-level optimization framework. It shows compile-level optimization—graph compilation with CUDA Graphs and computation-communication reordering—as the primary contributor to end-to-end speedups, achieving about $1.12$–$1.16\times$ on FLUX and $1.09\times$ on Qwen-Image, with additional gains from FP8 quantized communication and pipelined Ring attention in bandwidth-limited or long-sequence scenarios. A systematic performance analysis reveals that attention communication is a small fraction of per-step latency on high-bandwidth interconnects, explaining why kernel-launch optimization yields the biggest returns. The work also identifies compiler compatibility gaps (notably with PyTorch Inductor on Ring attention) that limit compile-time gains on some configurations, and it suggests directions for compiler-runtime co-design to unlock broader speedups in distributed diffusion inference.

Abstract

Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

TL;DR

–

on FLUX and

on Qwen-Image, with additional gains from FP8 quantized communication and pipelined Ring attention in bandwidth-limited or long-sequence scenarios. A systematic performance analysis reveals that attention communication is a small fraction of per-step latency on high-bandwidth interconnects, explaining why kernel-launch optimization yields the biggest returns. The work also identifies compiler compatibility gaps (notably with PyTorch Inductor on Ring attention) that limit compile-time gains on some configurations, and it suggests directions for compiler-runtime co-design to unlock broader speedups in distributed diffusion inference.

Abstract

--1.16

} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09

} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30

--1.46

of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

Paper Structure (24 sections, 2 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 2 equations, 5 figures, 6 tables, 2 algorithms.

Introduction
Background and Related Work
Diffusion Model Inference
Attention Parallelism
Ulysses: Head-Parallel Attention
Ring Attention: Sequence-Parallel Attention
USP: Unified Sequence Parallelism
Communication Complexity Comparison
Related Work
FastUSP Design
Overview
Compile-Level Optimization (Primary)
Graph Compilation with CUDA Graphs
Computation-Communication Reordering
Communication-Level Optimization
...and 9 more sections

Figures (5)

Figure 1: End-to-end performance on FLUX. (a) Inference time comparison between Baseline USP and FastUSP across 2, 4, and 8 GPUs. (b) FastUSP speedup over baseline, showing consistent 1.12$\times$--1.16$\times$ improvement.
Figure 2: Cross-model evaluation. (a) FastUSP speedup on all compile-compatible configurations (FLUX 2/4/8 GPU and Qwen-Image 2 GPU). (b) Qwen-Image scaling with baseline USP, showing sub-linear scaling due to memory constraints and communication overhead.
Figure 3: Per-step denoising latency. (a) FLUX: FastUSP reduces per-step latency from 288ms to 249ms (2 GPU) and from 174ms to 156ms (8 GPU). (b) Qwen-Image: baseline USP per-step latency across GPU configurations, with the FastUSP 2-GPU result shown.
Figure 4: Micro-benchmark: single attention operation latency (2 GPU, seq_len=2048). Pipelined Ring achieves 1.25$\times$ speedup; adding FP8 yields 1.27$\times$. These gains are significant at the operator level but translate to $<$1% end-to-end improvement.
Figure 5: Multi-GPU scaling efficiency. Both FLUX and Qwen-Image exhibit sub-linear scaling relative to ideal, with Qwen-Image showing greater scaling loss due to its larger memory footprint and higher communication overhead.

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

TL;DR

Abstract

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (5)