AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Haiying Shen, Tanmoy Sen
TL;DR
AccelGen tackles mixed-prompt LLM inference with heterogeneous iteration-level SLOs by coupling SLO-aware batching, dynamic long-prompt chunking, and joint GPU-KVC resource optimization. The three-pronged design—SLO-guaranteed dynamic chunking, iteration-level SLO-based task prioritization, and multi-resource-aware batching—enables near-Oracle goodput and substantial gains in throughput and SLO attainment across large models and diverse datasets. It introduces token-budget decisions and resource-aware scheduling to balance compute and KV-cache demands, achieving up to an order-of-magnitude improvement over state-of-the-art baselines while reducing JCT. The approach has strong practical impact for real-world LLM serving where latency, throughput, and user experience depend on both compute and memory-resource management under diverse SLOs.
Abstract
In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.
