Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Zizhao Mo; Junlin Chen; Huanle Xu; Chengzhong Xu

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Zizhao Mo, Junlin Chen, Huanle Xu, Chengzhong Xu

Abstract

Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to $1.48\times$ while enhancing BE serving throughput by up to $9.85\times$ compared to state-of-the-art systems.

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Abstract

while enhancing BE serving throughput by up to

compared to state-of-the-art systems.

Paper Structure (47 sections, 6 equations, 27 figures, 2 tables, 1 algorithm)

This paper contains 47 sections, 6 equations, 27 figures, 2 tables, 1 algorithm.

Introduction
Background and Motivation
LLM Serving Basics
LLM inference workflow
LLM inference optimization
LLM Inference Service Requirements
Existing Systems for Hybrid LLM Serving Loads
Interference between LS and BE Services
Latency interference on LS service
Serving capacity reduction on BE service
Hybrid Serving: Opportunities and Challenges
Opportunities
Challenges
OmniServe System
Overview of OmniServe
...and 32 more sections

Figures (27)

Figure 1: Illustration of the complete LLM inference workflow for token generation. All layers within the model must be executed sequentially and are composed of an identical set of computation modules.
Figure 2: Per-token latency
Figure 3: MLP latency
Figure 4: Attention latency
Figure 6: Memory hierarchy along with the bandwidth and capacity information as well as the prioritization on memory resource usage, when LS and BE services are colocated on the same device.
...and 22 more figures

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Abstract

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Authors

Abstract

Table of Contents

Figures (27)