Table of Contents
Fetching ...

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

Yueying Li, Jim Dai, Tianyi Peng

TL;DR

This work develops a queueing-theoretic framework for LLM inference that captures the dual-phase processing (prefill and decode) and batch formation to quantify throughput. It proves that a broad class of work-conserving scheduling algorithms can achieve the maximal throughput bound $b/t_b$ for a single LLM engine, with stability characterized by $\lambda(m_p+m_d) < b/t_b$, and demonstrates that multi-engine AI-agent networks require more sophisticated scheduling to maintain throughput, as illustrated by fork-join and RS networks. The paper shows that Orca and Sarathi-Serve are throughput-optimal in practice, while FasterTransformer and vanilla vLLM can be unstable under moderate load, providing practical guidance for system design. It also extends the analysis to AI-agent workloads, revealing scenarios where even work-conserving policies fail and highlighting the need for throughput-aware scheduling in distributed, collaborative inference systems. Overall, the results offer a formal foundation linking queueing theory to LLM-serving systems and advocate for interdisciplinary collaboration to optimize both throughput and latency in large-scale AI deployments.

Abstract

As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective. In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap between the queueing theory and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual inference LLM engine, highlighting 'work-conserving' as a key design principle in practice. In a network of LLM agents, work-conserving scheduling alone is insufficient, particularly when facing specific workload structures and multi-class workflows that require more sophisticated scheduling strategies. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits that the queueing community can offer in improving LLM inference systems and call for more interdisciplinary development.

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

TL;DR

This work develops a queueing-theoretic framework for LLM inference that captures the dual-phase processing (prefill and decode) and batch formation to quantify throughput. It proves that a broad class of work-conserving scheduling algorithms can achieve the maximal throughput bound for a single LLM engine, with stability characterized by , and demonstrates that multi-engine AI-agent networks require more sophisticated scheduling to maintain throughput, as illustrated by fork-join and RS networks. The paper shows that Orca and Sarathi-Serve are throughput-optimal in practice, while FasterTransformer and vanilla vLLM can be unstable under moderate load, providing practical guidance for system design. It also extends the analysis to AI-agent workloads, revealing scenarios where even work-conserving policies fail and highlighting the need for throughput-aware scheduling in distributed, collaborative inference systems. Overall, the results offer a formal foundation linking queueing theory to LLM-serving systems and advocate for interdisciplinary collaboration to optimize both throughput and latency in large-scale AI deployments.

Abstract

As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective. In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap between the queueing theory and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual inference LLM engine, highlighting 'work-conserving' as a key design principle in practice. In a network of LLM agents, work-conserving scheduling alone is insufficient, particularly when facing specific workload structures and multi-class workflows that require more sophisticated scheduling strategies. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits that the queueing community can offer in improving LLM inference systems and call for more interdisciplinary development.

Paper Structure

This paper contains 25 sections, 7 theorems, 65 equations, 9 figures.

Key Result

Theorem 1

Assume the iid assumption (eq:iid) and the first moment assumption (eq:1m). Assume further the second moment assumption $\mathbb{E}[D_1^2]<\infty.$ (a) Assume the following system load condition Then the DTMC $\{X_n, n\in \mathbb{N}\}$ is positive recurrent under any work-conserving $(K_p, K_d)$-FCFS algorithm. (b) Assume Then where $\lvert X(n)\rvert=\sum_{i\in \mathcal{Q}_n}(P_i(n)+D_i(n))$ i

Figures (9)

  • Figure 1: Experimental results demonstrating the instability of FasterTransformer and vLLM. The setup is using a CodeLlama-34B with A100, requests are homogeneous with an average of 129 prefill and 112 decode tokens, arriving with 14.3 queries per second.
  • Figure 2: Visualization of key scheduling terminologies in LLM engine. In this example, before Iteration 1, Request 1 has progressed to the decoding stage, while Request 2 and Request 3 have just arrived at the serving engine.
  • Figure 3: Batch processing time remains relatively constant for a given token budget (when the LLM is at full token load), and the CoV (coefficient of variation) becomes even smaller given larger models. The time is measured with SGLang under a high load with various token budgets (maximum batch size and token numbers) with different request prefill/decode compositions driven by ShareGPT dataset sharegpt.
  • Figure 4: Piecewise linear fit for CodeLlama-34B and Llama-70B models for batch processing time under various token budgets and Tensor-parallel sizes. $R^2$ is above 0.985 for all cases.
  • Figure 5: Example workload and where work-conserving criteria are broken. For vLLM, the second batch is not work-conserving because of limited prefill, and decoding tokens from earlier requests are still waiting. For FasterTransformer, the second to fourth batches are not work-conserving, because the prefills are blocked with earlier decodes.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Theorem 1
  • proof : Proof Scketch.
  • Proposition 1
  • Conjecture 1
  • proof
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 2
  • ...and 12 more