Table of Contents
Fetching ...

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing

Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan

TL;DR

This work tackles high latency in LLM inference by recognizing distinct compute/memory profiles for the prompt/prefill and decode phases and the need to account for workload mixing across multiple homogeneous LLM instances. It proposes a workload-aware intelligent router that combines a DistillBERT-based output-length predictor with a latency impact estimator and frames routing as a heuristic-guided reinforcement learning problem over a discrete-time Markov decision process $\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma)$ to optimally assign requests to model instances. Key contributions include (1) a latency impact model for mixing requests, (2) a lightweight decode-length predictor, (3) a heuristic-guided RL routing framework, and (4) a benchmarking-style evaluation showing end-to-end latency improvements (e.g., $11$–$19\%$ on synthetic datasets and $7.8\%$ on real production traces) and generalization across hardware/model configurations. The approach advances practical LLM serving by enabling data-driven, workload-aware load balancing and establishing a potential standard for evaluating inference schedulers. It also demonstrates robustness to optimizations at the model-instance level (e.g., prefill chunking) and can guide future routing benchmarks in production environments.

Abstract

Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster. However existing scheduling algorithms treat LLM workloads as monolithic jobs without considering the distinct characteristics of the two phases in each workload. This leads to sub-optimal scheduling and increased response latency. In this work, we start by characterizing factors affecting the response latency during LLM inference serving. We establish that better load balancing of inference requests across the available LLM instances can improve the end-to-end latency to a larger extent than merely focusing on optimizing the instance-level scheduler. Motivated by our findings, we propose a heuristic-guided reinforcement learning-based intelligent router for data-driven and workload-aware scheduling. Our router schedules queries across LLM instances by leveraging a trainable response-length predictor, and a novel formulation for estimating the impact of mixing different workloads and achieves over 11% lower end-to-end latency than existing approaches on a mix of public datasets and 7.8% lower end-to-end latency on real workload data with diverse input and output trends from Cloud Provider X. Additionally, the proposed framework can also serve as a standard for benchmarking different LLM inference schedulers since it provides the best latency for a given model, hardware, and instance-level scheduler combination.

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing

TL;DR

This work tackles high latency in LLM inference by recognizing distinct compute/memory profiles for the prompt/prefill and decode phases and the need to account for workload mixing across multiple homogeneous LLM instances. It proposes a workload-aware intelligent router that combines a DistillBERT-based output-length predictor with a latency impact estimator and frames routing as a heuristic-guided reinforcement learning problem over a discrete-time Markov decision process to optimally assign requests to model instances. Key contributions include (1) a latency impact model for mixing requests, (2) a lightweight decode-length predictor, (3) a heuristic-guided RL routing framework, and (4) a benchmarking-style evaluation showing end-to-end latency improvements (e.g., on synthetic datasets and on real production traces) and generalization across hardware/model configurations. The approach advances practical LLM serving by enabling data-driven, workload-aware load balancing and establishing a potential standard for evaluating inference schedulers. It also demonstrates robustness to optimizations at the model-instance level (e.g., prefill chunking) and can guide future routing benchmarks in production environments.

Abstract

Large Language Model (LLM) workloads have distinct prefill and decode phases with different compute and memory requirements which should ideally be accounted for when scheduling input queries across different LLM instances in a cluster. However existing scheduling algorithms treat LLM workloads as monolithic jobs without considering the distinct characteristics of the two phases in each workload. This leads to sub-optimal scheduling and increased response latency. In this work, we start by characterizing factors affecting the response latency during LLM inference serving. We establish that better load balancing of inference requests across the available LLM instances can improve the end-to-end latency to a larger extent than merely focusing on optimizing the instance-level scheduler. Motivated by our findings, we propose a heuristic-guided reinforcement learning-based intelligent router for data-driven and workload-aware scheduling. Our router schedules queries across LLM instances by leveraging a trainable response-length predictor, and a novel formulation for estimating the impact of mixing different workloads and achieves over 11% lower end-to-end latency than existing approaches on a mix of public datasets and 7.8% lower end-to-end latency on real workload data with diverse input and output trends from Cloud Provider X. Additionally, the proposed framework can also serve as a standard for benchmarking different LLM inference schedulers since it provides the best latency for a given model, hardware, and instance-level scheduler combination.
Paper Structure (1 section, 1 figure)

This paper contains 1 section, 1 figure.

Table of Contents

  1. Introduction

Figures (1)

  • Figure :