Table of Contents
Fetching ...

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Youpeng Zhao, Jun Wang

TL;DR

This paper proposes a new efficient LLM inference serving framework, named ALISE, to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads.

Abstract

Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal throughput for inference serving. However, due to the unpredictability of LLM execution, the first-come-first-serve (FCFS) scheduling policy employed by current LLM serving systems suffers from head-of-line (HoL) blocking issues and long job response times. In this paper, we propose a new efficient LLM inference serving framework, named ALISE. The key design paradigm of ALISE is to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads. Furthermore, to mitigate the memory overhead of the intermediate key-value (KV) cache, we employ a priority-based adaptive memory management protocol and quantization-based compression techniques. Evaluations demonstrate that in comparison to the state-of-the-art solution vLLM, ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

TL;DR

This paper proposes a new efficient LLM inference serving framework, named ALISE, to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads.

Abstract

Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal throughput for inference serving. However, due to the unpredictability of LLM execution, the first-come-first-serve (FCFS) scheduling policy employed by current LLM serving systems suffers from head-of-line (HoL) blocking issues and long job response times. In this paper, we propose a new efficient LLM inference serving framework, named ALISE. The key design paradigm of ALISE is to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads. Furthermore, to mitigate the memory overhead of the intermediate key-value (KV) cache, we employ a priority-based adaptive memory management protocol and quantization-based compression techniques. Evaluations demonstrate that in comparison to the state-of-the-art solution vLLM, ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.

Paper Structure

This paper contains 12 sections, 7 equations, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: (a) Operations in the transformer layer. (b) An illustrative example of the autoregressive LLM inference process. (c) KV cache mechanism: at the prefilling stage, all input tokens are processed simultaneously, and the KV cache is initialized; at the decoding stage, the stored KV cache is retrieved for reuse and updated by iteration until termination.
  • Figure 2: End-to-end performance comparison of existing FCFS scheduling and speculative scheduling in ALISE on the ShareGPT dataset.
  • Figure 3: System overview of ALISE.
  • Figure 4: Retrieval-based Length Predictor Architecture.
  • Figure 5: Execution breakdown of the OPT-13B. The left figure shows the prefill execution time with different input lengths ($s$), and the right figure shows the execution time for different decoding steps.
  • ...and 4 more figures