Table of Contents
Fetching ...

Learned Best-Effort LLM Serving

Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

TL;DR

The paper tackles latency guarantees for LLM serving without over-provisioning hardware by introducing a learned best-effort serving framework that uses a dynamic router trained with deep reinforcement learning to dispatch requests across multiple model sizes. It formulates the problem as a latency-quality optimization with hard and soft deadlines and solves it via a Deep Q-network policy that conditions on task, per-model batch sizes, and arrival rate. Empirical results on stable and unpredictable workloads show that the learned router outperforms static baselines in availability and peak performance, while delivering significantly higher hardware utility. The approach offers flexibility through reward design and demonstrates robustness to shifts in task distributions, making it applicable to a wide range of latency-sensitive applications. Overall, learned best-effort serving provides a cost-efficient, adaptable paradigm for LLM inference in real-world, dynamic environments.

Abstract

Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.

Learned Best-Effort LLM Serving

TL;DR

The paper tackles latency guarantees for LLM serving without over-provisioning hardware by introducing a learned best-effort serving framework that uses a dynamic router trained with deep reinforcement learning to dispatch requests across multiple model sizes. It formulates the problem as a latency-quality optimization with hard and soft deadlines and solves it via a Deep Q-network policy that conditions on task, per-model batch sizes, and arrival rate. Empirical results on stable and unpredictable workloads show that the learned router outperforms static baselines in availability and peak performance, while delivering significantly higher hardware utility. The approach offers flexibility through reward design and demonstrates robustness to shifts in task distributions, making it applicable to a wide range of latency-sensitive applications. Overall, learned best-effort serving provides a cost-efficient, adaptable paradigm for LLM inference in real-world, dynamic environments.

Abstract

Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10x higher client request rates, serves above 96% of peak performance 4.1x more often, and serves above 98% of peak performance 2.3x more often than static serving on unpredictable workloads. Our learned router is robust to shifts in both the arrival and task distribution. Compared to static serving, learned best-effort serving allows for cost-efficient serving through increased hardware utility. Additionally, we argue that learned best-effort LLM serving is applicable in wide variety of settings and provides application developers great flexibility to meet their specific needs.
Paper Structure (22 sections, 1 equation, 14 figures, 4 tables)

This paper contains 22 sections, 1 equation, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Learned best-effort serving consists of multiple models serving multiple tasks, with a router that keeps track of system load and task information in order to route requests to models. In this example, each model is replicated on 4 GPUs, but any model partitioning and replication scheme may be used. Additionally, any number of models and tasks may be used.
  • Figure 2: The left figure shows the performance with hard deadlines. The right figure shows the distribution of model selection from the policy.
  • Figure 3: Model selection frequency for each individual task with hard deadlines.
  • Figure 4: The left figure shows the performance with soft deadlines. The right figure shows the distribution of model selection from the policy.
  • Figure 5: Model selection frequency for each individual task with soft deadlines.
  • ...and 9 more figures