Don't Stop Me Now: Embedding Based Scheduling for LLMs

Rana Shahout; Eran Malach; Chunwei Liu; Weifan Jiang; Minlan Yu; Michael Mitzenmacher

Don't Stop Me Now: Embedding Based Scheduling for LLMs

Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher

TL;DR

A prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems is proposed, which allows preemption early in request execution when memory consumption is low but restricts preemption as requests approach completion to optimize resource utilization.

Abstract

Efficient scheduling is crucial for interactive Large Language Model (LLM) applications, where low request completion time directly impacts user engagement. Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time by leveraging known or estimated request sizes and allowing preemption by incoming jobs with shorter service times. However, two main challenges arise when applying size-based scheduling to LLM systems. First, accurately predicting output lengths from prompts is challenging and often resource-intensive, making it impractical for many systems. As a result, the state-of-the-art LLM systems default to first-come, first-served scheduling, which can lead to head-of-line blocking and reduced system efficiency. Second, preemption introduces extra memory overhead to LLM systems as they must maintain intermediate states for unfinished (preempted) requests. In this paper, we propose TRAIL, a method to obtain output predictions from the target LLM itself. After generating each output token, we recycle the embedding of its internal structure as input for a lightweight classifier that predicts the remaining length for each running request. Using these predictions, we propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems. This variant allows preemption early in request execution when memory consumption is low but restricts preemption as requests approach completion to optimize resource utilization. On the theoretical side, we derive a closed-form formula for this SRPT variant in an M/G/1 queue model, which demonstrates its potential value. In our system, we implement this preemption policy alongside our embedding-based prediction method.

Don't Stop Me Now: Embedding Based Scheduling for LLMs

TL;DR

Abstract

Paper Structure (23 sections, 3 theorems, 33 equations, 8 figures, 1 table)

This paper contains 23 sections, 3 theorems, 33 equations, 8 figures, 1 table.

Introduction
Background and Motivation
Transformer-Based Generative Models and Key-Value Cache.
Iteration-Level Scheduling.
Method
Refined Output Length Prediction
Computing Output Length Prediction Per Iteration
Scheduling Policy
Evaluation
Predictions Accuracy
LLM Serving
Related Works
Limitations and Conclusion
Appendix
Bayesian Inference Transition matrix
...and 8 more sections

Key Result

Lemma 1

For SPRPT with limited preemption, where at age $a_0$ the jobs become non-preemptable, the expected mean response time for a job of true size $x$ and predicted size $r$ is where $\rho'_r= \lambda \int_{y = 0}^{r} \int_{x_I = 0}^{\infty} x_I \cdot g(x_I,y) dx_I dy$.

Figures (8)

Figure 1: Trail architecture. The system (1) initially orders requests using a BERT model, (2) schedules requests using a modified SPRPT with limited preemption, and (3) refines predictions during token generation using embeddings from the LLM's internal layers. At every iteration, steps 2 and 3 are repeated (represented as red dashed lines), which allows preemption at iteration-level granularity and refined predictions. We focus on identifying the LLM layer that best predicts output length rather than using multi-layer embeddings ($i=j=11$).
Figure 2: MAE for length prediction using embeddings vs. layer (1,000 prompts).
Figure 3: Mean Absolute Error for the predicted length, comparing BERT input embedding (dashed red), average token embedding without refinement (blue), and with refinement (orange), for different layers.
Figure 4: Log-scaled comparison of ground-truth vs predicted lengths bins. The $i$-th bin $b_i$ covers the range $\left[\frac{512i}{10}, \frac{512(i+1)}{10}\right)$.
Figure 5: Comparison of mean latency and TTFT across different values of $c$ ($c = 0.5, 0.8, 1$) at a request rate of 14.
...and 3 more figures

Theorems & Definitions (4)

Lemma 1
Theorem 1: Theorem 5.5 of scully2018soap
Lemma 1
proof

Don't Stop Me Now: Embedding Based Scheduling for LLMs

TL;DR

Abstract

Don't Stop Me Now: Embedding Based Scheduling for LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)