Table of Contents
Fetching ...

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang

Abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by for online inference and improves throughput by for offline data generation.

Paper Structure

This paper contains 39 sections, 2 theorems, 42 equations, 10 figures, 8 tables.

Key Result

Theorem 3.2

Under Assumption asm:effective_rate, the tail probability of output length follows a power-law decay: $\blacktriangleleft$$\blacktriangleleft$

Figures (10)

  • Figure 1: Output length distribution for the first prompt in LMSYS-Chat-1M dataset zheng2023lmsys. Bars are the output lengths in 256 generations, while red curve is the fitted log-t distribution.
  • Figure 2: Overview of the overall scoring pipeline. The prompt is prepended with a CLS token and encoded by DeBERTa. A multi-pooling strategy aggregates the CLS token, mean-pooled, and max-pooled representations. Two separate prediction heads (MLPs) predict $\hat{\mu}$ and $\hat{\sigma}$, which together with a fixed $\nu = 3.5$ are used to construct the log-t distribution. The final score is computed from $\mathbb{E}(\tilde{X})$ and CVaR.
  • Figure 3: Scheduling workflow for new requests. Solid arrows indicate the main workflow, while dashed arrows represent the asynchronous prediction workflow.
  • Figure 4: Performance of schedulers for online chatbot serving on the real-world workload (LMSYS-Chat-1M) with the 8B model.
  • Figure 5: Heatmaps of completion time versus output length under different strategies on the Alpaca dataset with the 8B model. TIE achieves higher concentration than SSJF and LTR, indicating its ability to accurately capture possible output lengths.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem 3.2
  • Lemma 1.1: Tail Probability Expression
  • proof
  • proof : Proof of Theorem \ref{['thm:heavy_tail_main']}
  • Remark 1.3: Interpretation of Assumption \ref{['asm:effective_rate']} and Assumption \ref{['asm:effective_rate_recall']}