Table of Contents
Fetching ...

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Dakai An, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang

TL;DR

SwiftDiffusion tackles the latency and throughput penalties of production text-to-image serving when incorporating ControlNet and LoRA add-ons. It introduces three innovations: ControlNet-as-a-Service to cache and parallelize ControlNets on dedicated GPUs, bounded asynchronous LoRA loading (BAL) to overlap LoRA loading with base-model execution, and latent parallelism with kernel-level optimizations to accelerate base diffusion inference. Together, these designs achieve up to 7.8x reductions in latency and 1.6x improvements in throughput on SDXL models without sacrificing image quality, and they generalize to DiT-based diffusion models. The work provides a production-oriented characterization of add-on bottlenecks and demonstrates practical, scalable improvements suitable for large-scale T2I services.

Abstract

Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality.

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

TL;DR

SwiftDiffusion tackles the latency and throughput penalties of production text-to-image serving when incorporating ControlNet and LoRA add-ons. It introduces three innovations: ControlNet-as-a-Service to cache and parallelize ControlNets on dedicated GPUs, bounded asynchronous LoRA loading (BAL) to overlap LoRA loading with base-model execution, and latent parallelism with kernel-level optimizations to accelerate base diffusion inference. Together, these designs achieve up to 7.8x reductions in latency and 1.6x improvements in throughput on SDXL models without sacrificing image quality, and they generalize to DiT-based diffusion models. The work provides a production-oriented characterization of add-on bottlenecks and demonstrates practical, scalable improvements suitable for large-scale T2I services.

Abstract

Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality.
Paper Structure (23 sections, 2 equations, 16 figures, 6 tables)

This paper contains 23 sections, 2 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Effects of ControlNet and LoRA in image generation with SDXL under the same prompt: racing game car, yellow Ferrari. Left: without ControlNet, the generated images can have different compositions. Center: ControlNet uses a reference image to control the composition. Right: using LoRA to generate image in a papercut style.
  • Figure 2: ControlNets and LoRAs introduce additional latency overhead. In each workflow, a common base SDXL podell2024sdxl model is augmented with $m$ ControlNets and $n$ LoRAs ($m$C/$n$L), served by Diffusers diffusers on an H800 GPU.
  • Figure 3: A workflow of text-to-image with a stable diffusion model. Time embedding is ignored for simplicity.
  • Figure 4: Left: ControlNet has a small population and exhibits a skewed popularity; the long tail of the graph is truncated for a better presentation. Right: LoRA has a large quantity and exhibits a long-tailed distribution in popularity.
  • Figure 5: ControlNet loading overhead can be alleviated using a larger LRU cache, while LoRA performance gains are less pronounced. Left: Service A; Right: Service B.
  • ...and 11 more figures