Table of Contents
Fetching ...

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Guandong Li

TL;DR

FastUSP tackles the bottleneck of kernel launch overhead in distributed diffusion model inference by introducing a multi-level optimization framework. It shows compile-level optimization—graph compilation with CUDA Graphs and computation-communication reordering—as the primary contributor to end-to-end speedups, achieving about $1.12$–$1.16\times$ on FLUX and $1.09\times$ on Qwen-Image, with additional gains from FP8 quantized communication and pipelined Ring attention in bandwidth-limited or long-sequence scenarios. A systematic performance analysis reveals that attention communication is a small fraction of per-step latency on high-bandwidth interconnects, explaining why kernel-launch optimization yields the biggest returns. The work also identifies compiler compatibility gaps (notably with PyTorch Inductor on Ring attention) that limit compile-time gains on some configurations, and it suggests directions for compiler-runtime co-design to unlock broader speedups in distributed diffusion inference.

Abstract

Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

TL;DR

FastUSP tackles the bottleneck of kernel launch overhead in distributed diffusion model inference by introducing a multi-level optimization framework. It shows compile-level optimization—graph compilation with CUDA Graphs and computation-communication reordering—as the primary contributor to end-to-end speedups, achieving about on FLUX and on Qwen-Image, with additional gains from FP8 quantized communication and pipelined Ring attention in bandwidth-limited or long-sequence scenarios. A systematic performance analysis reveals that attention communication is a small fraction of per-step latency on high-bandwidth interconnects, explaining why kernel-launch optimization yields the biggest returns. The work also identifies compiler compatibility gaps (notably with PyTorch Inductor on Ring attention) that limit compile-time gains on some configurations, and it suggests directions for compiler-runtime co-design to unlock broader speedups in distributed diffusion inference.

Abstract

Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12--1.16} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30--1.46 of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.
Paper Structure (24 sections, 2 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 24 sections, 2 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: End-to-end performance on FLUX. (a) Inference time comparison between Baseline USP and FastUSP across 2, 4, and 8 GPUs. (b) FastUSP speedup over baseline, showing consistent 1.12$\times$--1.16$\times$ improvement.
  • Figure 2: Cross-model evaluation. (a) FastUSP speedup on all compile-compatible configurations (FLUX 2/4/8 GPU and Qwen-Image 2 GPU). (b) Qwen-Image scaling with baseline USP, showing sub-linear scaling due to memory constraints and communication overhead.
  • Figure 3: Per-step denoising latency. (a) FLUX: FastUSP reduces per-step latency from 288ms to 249ms (2 GPU) and from 174ms to 156ms (8 GPU). (b) Qwen-Image: baseline USP per-step latency across GPU configurations, with the FastUSP 2-GPU result shown.
  • Figure 4: Micro-benchmark: single attention operation latency (2 GPU, seq_len=2048). Pipelined Ring achieves 1.25$\times$ speedup; adding FP8 yields 1.27$\times$. These gains are significant at the operator level but translate to $<$1% end-to-end improvement.
  • Figure 5: Multi-GPU scaling efficiency. Both FLUX and Qwen-Image exhibit sub-linear scaling relative to ideal, with Qwen-Image showing greater scaling loss due to its larger memory footprint and higher communication overhead.