Table of Contents
Fetching ...

PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving

Desen Sun, Zepeng Zhao, Yuke Wang

TL;DR

PatchedServe addresses the bottlenecks of serving diffusion-based T2I models under mixed-resolution requests by introducing a patch-based inference framework. It deploys a Compressed Sparse Patch (CSP) representation, a boundary-aware Patch Edge Stitcher, and a patch-level caching policy, all guided by an SLO-aware scheduler with an online latency predictor. Empirical results show a substantial improvement in SLO satisfaction (up to $30.1\%$ over the state-of-the-art) while preserving image quality, and the approach scales effectively from 2 to 8 GPUs. The work demonstrates that patch-level locality and smart scheduling can unlock high parallelism for heterogeneous inputs, offering practical benefits for real-world diffusion-serving systems.

Abstract

The Text-to-Image (T2I) diffusion model has emerged as one of the most widely adopted generative models. However, serving diffusion models at the granularity of entire images introduces significant challenges, particularly under multi-resolution workloads. First, image-level serving obstructs batching across requests. Second, heterogeneous resolutions exhibit distinct locality characteristics, making it difficult to apply a uniform cache policy effectively. To address these challenges, we present PatchedServe, a Patch Management Framework for SLO-Optimized Hybrid-Resolution Diffusion Serving. PatchedServe is the first SLO-optimized T2I diffusion serving framework designed to handle heterogeneous resolutions. Specifically, it incorporates a novel patch-based processing workflow that substantially improves throughput for hybrid-resolution inputs. Moreover, PatchedServe devises a patch-level cache reuse policy to fully exploit diffusion redundancies and integrates an SLO-aware scheduling algorithm with lightweight online latency prediction to improve responsiveness. Our evaluation demonstrates that PatchedServe achieves 30.1 % higher SLO satisfaction than the state-of-the-art diffusion serving system, while preserving image quality.

PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving

TL;DR

PatchedServe addresses the bottlenecks of serving diffusion-based T2I models under mixed-resolution requests by introducing a patch-based inference framework. It deploys a Compressed Sparse Patch (CSP) representation, a boundary-aware Patch Edge Stitcher, and a patch-level caching policy, all guided by an SLO-aware scheduler with an online latency predictor. Empirical results show a substantial improvement in SLO satisfaction (up to over the state-of-the-art) while preserving image quality, and the approach scales effectively from 2 to 8 GPUs. The work demonstrates that patch-level locality and smart scheduling can unlock high parallelism for heterogeneous inputs, offering practical benefits for real-world diffusion-serving systems.

Abstract

The Text-to-Image (T2I) diffusion model has emerged as one of the most widely adopted generative models. However, serving diffusion models at the granularity of entire images introduces significant challenges, particularly under multi-resolution workloads. First, image-level serving obstructs batching across requests. Second, heterogeneous resolutions exhibit distinct locality characteristics, making it difficult to apply a uniform cache policy effectively. To address these challenges, we present PatchedServe, a Patch Management Framework for SLO-Optimized Hybrid-Resolution Diffusion Serving. PatchedServe is the first SLO-optimized T2I diffusion serving framework designed to handle heterogeneous resolutions. Specifically, it incorporates a novel patch-based processing workflow that substantially improves throughput for hybrid-resolution inputs. Moreover, PatchedServe devises a patch-level cache reuse policy to fully exploit diffusion redundancies and integrates an SLO-aware scheduling algorithm with lightweight online latency prediction to improve responsiveness. Our evaluation demonstrates that PatchedServe achieves 30.1 % higher SLO satisfaction than the state-of-the-art diffusion serving system, while preserving image quality.
Paper Structure (21 sections, 1 equation, 19 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 1 equation, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Assume three requests, Req1, Req2, and Req3, where each requiring processing over N steps, from St N to St 0. (a) Process requests sequentially. (b) Process requests in parallel, achieving higher GPU utilization.
  • Figure 2: Overview of PatchedServe.
  • Figure 3: Latent Diffusion Model Structure. Two main types of backbones in the Diffusion model: U-Net and Diffusion Transformer (DiT).
  • Figure 4: Two T2I diffusion optimization techniques. (a) Distrifusion splits the image into multiple patches and dispatches them to different GPUs. (b) Block Caching leverages the locality, reusing block output from the previous step, and skipping the corresponding block in the current step.
  • Figure 5: Average savings from skipped computations.
  • ...and 14 more figures