Table of Contents
Fetching ...

Queueing, Predictions, and LLMs: Challenges and Open Problems

Michael Mitzenmacher, Rana Shahout

TL;DR

The paper tackles the question of how predictive information can improve scheduling in queueing systems and, more specifically, in Large Language Model (LLM) inference serving. It surveys prediction-based scheduling in classic queues (M/G/1, 1-bit predictions, and online-advice frameworks) and then deeply analyzes LLM-specific scheduling challenges, including KV-cache memory management, preemption costs, and multi-stage processing. It introduces dynamic batching, adaptive, and cost-aware policies, and surveys compound AI settings with augmented LLMs, multiple LLMs, and reasoning systems, highlighting when and how predictions can reduce latency and improve throughput. The work identifies critical open problems, such as extending SOAP-based analyses to richer prediction models, accounting for prediction costs, designing split-phase and multi-GPU strategies, and developing theory-backed policies for complex AI pipelines, underscoring the strong potential for queueing theory to guide practical LLM-serving systems.

Abstract

Queueing systems present many opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance. This integration raises numerous open questions about how predictions can be effectively leveraged to improve scheduling decisions. Recent studies explore queues with predicted service times, typically aiming to minimize job time in the system. We review these works, highlight the effectiveness of predictions, and present open questions on queue performance. We then move to consider an important practical example of using predictions in scheduling, namely Large Language Model (LLM) systems, which presents novel scheduling challenges and highlights the potential for predictions to improve performance. In particular, we consider LLMs performing inference. Inference requests (jobs) in LLM systems are inherently complex; they have variable inference times, dynamic memory footprints that are constrained by key-value (KV) store memory limitations, and multiple possible preemption approaches that affect performance differently. We provide background on the important aspects of scheduling in LLM systems, and introduce new models and open problems that arise from them. We argue that there are significant opportunities for applying insights and analysis from queueing theory to scheduling in LLM systems.

Queueing, Predictions, and LLMs: Challenges and Open Problems

TL;DR

The paper tackles the question of how predictive information can improve scheduling in queueing systems and, more specifically, in Large Language Model (LLM) inference serving. It surveys prediction-based scheduling in classic queues (M/G/1, 1-bit predictions, and online-advice frameworks) and then deeply analyzes LLM-specific scheduling challenges, including KV-cache memory management, preemption costs, and multi-stage processing. It introduces dynamic batching, adaptive, and cost-aware policies, and surveys compound AI settings with augmented LLMs, multiple LLMs, and reasoning systems, highlighting when and how predictions can reduce latency and improve throughput. The work identifies critical open problems, such as extending SOAP-based analyses to richer prediction models, accounting for prediction costs, designing split-phase and multi-GPU strategies, and developing theory-backed policies for complex AI pipelines, underscoring the strong potential for queueing theory to guide practical LLM-serving systems.

Abstract

Queueing systems present many opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance. This integration raises numerous open questions about how predictions can be effectively leveraged to improve scheduling decisions. Recent studies explore queues with predicted service times, typically aiming to minimize job time in the system. We review these works, highlight the effectiveness of predictions, and present open questions on queue performance. We then move to consider an important practical example of using predictions in scheduling, namely Large Language Model (LLM) systems, which presents novel scheduling challenges and highlights the potential for predictions to improve performance. In particular, we consider LLMs performing inference. Inference requests (jobs) in LLM systems are inherently complex; they have variable inference times, dynamic memory footprints that are constrained by key-value (KV) store memory limitations, and multiple possible preemption approaches that affect performance differently. We provide background on the important aspects of scheduling in LLM systems, and introduce new models and open problems that arise from them. We argue that there are significant opportunities for applying insights and analysis from queueing theory to scheduling in LLM systems.

Paper Structure

This paper contains 25 sections, 8 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Rank functions of size-estimate-based policies. The rank function for SRPT is on the left; the rank decreases as $s-a$ where $s$ is the true size and $a$ is the age. The rank function for SPRPT is in the middle; it decreases as $z-a$ where $z$ is now the estimated size. Note that a job can have negative rank, at which point it cannot be preempted. The SPRPT-with-bounce rank function from ScullyGM22 is on the right; the rank decreases from the estimate $z$ to 0 but bounces back up, according to the function $\max(|z-a|,z)$. This rank bounce tempers the effect of long jobs that are predicted to be short delaying short jobs from being served.
  • Figure 2: SkipPredict framework under the server cost model and external cost model.
  • Figure 3: DelayPredict framework under the server cost model and external cost model.
  • Figure 4: Transformer architecture
  • Figure 5: A neural network layer is executed on hardware devices by transferring data from memory (e.g., HBM) to on-chip buffers, then computing with the on-chip processing units, and eventually sending the output data back to memory.
  • ...and 5 more figures