Table of Contents
Fetching ...

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

Wan Borui, Zhao Juntao, Jiang Chenyu, Guo Chuanxiong, Wu Chuan

TL;DR

BROS addresses the challenge of co-serving real-time and back-of-house LLM workloads on shared GPUs by introducing a dynamic, iteration-level scheduling policy and a bidirectional KV cache design that allows RT and BE requests to share memory efficiently. The two technical pillars—priority-based packing for RT/BE requests and a KV cache layout with block preemption and lazy checkpointing—together balance low RT latency with high BE throughput. The authors formulate the scheduling problem, develop cost models for iteration time, and implement a scalable system that demonstrates up to 74.20% RT latency reduction and substantial SLO attainments with minimal BE throughput loss across multiple models and datasets. This work offers practical improvements for real-world LLM serving, enabling better resource utilization and responsiveness in hybrid RT/BE deployments.

Abstract

Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI's ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

TL;DR

BROS addresses the challenge of co-serving real-time and back-of-house LLM workloads on shared GPUs by introducing a dynamic, iteration-level scheduling policy and a bidirectional KV cache design that allows RT and BE requests to share memory efficiently. The two technical pillars—priority-based packing for RT/BE requests and a KV cache layout with block preemption and lazy checkpointing—together balance low RT latency with high BE throughput. The authors formulate the scheduling problem, develop cost models for iteration time, and implement a scalable system that demonstrates up to 74.20% RT latency reduction and substantial SLO attainments with minimal BE throughput loss across multiple models and datasets. This work offers practical improvements for real-world LLM serving, enabling better resource utilization and responsiveness in hybrid RT/BE deployments.

Abstract

Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI's ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.

Paper Structure

This paper contains 24 sections, 8 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: A powerful LLM can serve real-time interactive services while also handling vast non-real-time back-of-house tasks.
  • Figure 2: The two phases of an LLM inference request.
  • Figure 3: Head-of-line blocking with FCFS and the benefit of preemptive scheduling. The queueing proportion is the mean of every request's queueing time divided by its end-to-end latency.
  • Figure 4: Round-robin scheduling for hybrid RT/BE requests.
  • Figure 5: GPU memory block allocation for storing KV cache under different cases.
  • ...and 9 more figures