OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Siyu Wu; Zihan Tang; Yuting Zeng; Hui Chen; Guiguang Ding; Tongxuan Liu; Ke Zhang; Hailong Yang

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Siyu Wu, Zihan Tang, Yuting Zeng, Hui Chen, Guiguang Ding, Tongxuan Liu, Ke Zhang, Hailong Yang

TL;DR

Co-locating online and offline LLM workloads using Prefill/Decode disaggregation creates P/D load imbalance under bursty online traffic. OOCO introduces a latency-constraint disaggregation with latency-relaxed and latency-strict pools, a Roofline-guided bottleneck-aware scheduler, and fast preemption/migration mechanisms to preserve online SLOs while boosting offline throughput. The approach yields up to 3x offline throughput improvements on real traces while maintaining online TTFT/TPOT, and is implemented on the xLLM engine with demonstrated portability. This work enhances resource utilization and cost efficiency for mixed online-offline LLM serving and can be ported to other high-performance inference frameworks.

Abstract

Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

TL;DR

Abstract

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)