Table of Contents
Fetching ...

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Siyu Wu, Zihan Tang, Yuting Zeng, Hui Chen, Guiguang Ding, Tongxuan Liu, Ke Zhang, Hailong Yang

TL;DR

Co-locating online and offline LLM workloads using Prefill/Decode disaggregation creates P/D load imbalance under bursty online traffic. OOCO introduces a latency-constraint disaggregation with latency-relaxed and latency-strict pools, a Roofline-guided bottleneck-aware scheduler, and fast preemption/migration mechanisms to preserve online SLOs while boosting offline throughput. The approach yields up to 3x offline throughput improvements on real traces while maintaining online TTFT/TPOT, and is implemented on the xLLM engine with demonstrated portability. This work enhances resource utilization and cost efficiency for mixed online-offline LLM serving and can be ported to other high-performance inference frameworks.

Abstract

Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

TL;DR

Co-locating online and offline LLM workloads using Prefill/Decode disaggregation creates P/D load imbalance under bursty online traffic. OOCO introduces a latency-constraint disaggregation with latency-relaxed and latency-strict pools, a Roofline-guided bottleneck-aware scheduler, and fast preemption/migration mechanisms to preserve online SLOs while boosting offline throughput. The approach yields up to 3x offline throughput improvements on real traces while maintaining online TTFT/TPOT, and is implemented on the xLLM engine with demonstrated portability. This work enhances resource utilization and cost efficiency for mixed online-offline LLM serving and can be ported to other high-performance inference frameworks.

Abstract

Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.

Paper Structure

This paper contains 30 sections, 1 equation, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Visualization of request traffic variations patterns, showing tide-like fluctuations at hourly and daily scales, as well as bursty spikes at minute scales, across three datasets: the online service part of our company’s OOC trace, Azure LLM Inference Trace 2024 (Conversation), and Azure LLM Inference Trace 2024 (Code).
  • Figure 2: The computation patterns of the main time-consuming operators in LLMs. *In the figure, non-matrix multiplication operations such as Softmax in the Attention operator are omitted. Additionally, for the commonly used Flash Attention dao2022flashattentiondao2023flashattention2hong2023flashdecoding++ operator, the intermediate score matrix is passed through on-chip buffers or cache, without generating memory accesses to the GPU / NPU memory.
  • Figure 3: Roofline analysis with corresponding latency of LLM inference (Qwen2.5 7B on Ascend 910c NPU). Each point denotes a Prefill or Decode execution under a given batch size and request length.
  • Figure 4: Overview of OOCO.
  • Figure 5: System architecture of OOCO. Components marked with * are supported by xLLM but are disabled in our implementation since they are orthogonal to this work (see Section \ref{['sec:discussion']} for more details).
  • ...and 1 more figures