Table of Contents
Fetching ...

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines

Ke Cheng, Zhi Wang, Wen Hu, Tiannuo Yang, Jianguo Li, Sheng Zhang

TL;DR

SCOOT is an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine and is universally applicable to various LLM inference engines including vLLM and TensorRT-LLM.

Abstract

As large language models (LLMs) are gaining increasing popularity across a wide range of web applications, it is of great importance to optimize service-level objectives (SLOs) for LLM inference services to enhance user satisfaction and improve the competitiveness of cloud vendors. In this paper, we observe that adjusting the parameters of LLM inference engines can improve service performance, and the optimal parameter configurations of different services are different. Therefore, we propose SCOOT, an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine. SCOOT jointly exploits single-objective and multiple-objective Bayesian optimization (BO) techniques to handle various optimization objectives via exploration and exploitation. Moreover, SCOOT prunes the search space with known constraints and adopts a random forest to learn hidden constraints during the tuning process to mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes the parallel suggestion to accelerate the tuning process. Extensive experiments demonstrate that SCOOT considerably outperforms existing tuning techniques in SLO optimization while greatly improving the tuning efficiency. Moreover, SCOOT is universally applicable to various LLM inference engines including vLLM and TensorRT-LLM. Currently, SCOOT has already been implemented in the production environment at Ant Group.

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines

TL;DR

SCOOT is an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine and is universally applicable to various LLM inference engines including vLLM and TensorRT-LLM.

Abstract

As large language models (LLMs) are gaining increasing popularity across a wide range of web applications, it is of great importance to optimize service-level objectives (SLOs) for LLM inference services to enhance user satisfaction and improve the competitiveness of cloud vendors. In this paper, we observe that adjusting the parameters of LLM inference engines can improve service performance, and the optimal parameter configurations of different services are different. Therefore, we propose SCOOT, an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine. SCOOT jointly exploits single-objective and multiple-objective Bayesian optimization (BO) techniques to handle various optimization objectives via exploration and exploitation. Moreover, SCOOT prunes the search space with known constraints and adopts a random forest to learn hidden constraints during the tuning process to mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes the parallel suggestion to accelerate the tuning process. Extensive experiments demonstrate that SCOOT considerably outperforms existing tuning techniques in SLO optimization while greatly improving the tuning efficiency. Moreover, SCOOT is universally applicable to various LLM inference engines including vLLM and TensorRT-LLM. Currently, SCOOT has already been implemented in the production environment at Ant Group.
Paper Structure (39 sections, 10 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 39 sections, 10 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Optimal TTFT and TPOT for various services. The TTFT and TPOT shown are relative values compared to those of the default parameter configuration. The lower, the better.
  • Figure 2: TTFT and TPOT for applying optimal parameter configurations of different services to other services. Conf. $X$ represents the optimal parameter configuration for the service $X\in\{A,B,C,D,E\}$. The red lines (1.0×) in Sub-figures (a) and (b) indicate the TTFT and TPOT under the default parameter configuration of the inference engine, respectively.
  • Figure 3: SCOOT workflow. SCOOT leverages BO to find optimized parameter configurations via exploration and exploitation.
  • Figure 4: Illustration of two-dimensional HV. $f_1$ and $f_2$ are two objective functions and $\boldsymbol{r}$ is the reference point. (a) Blue area represents the HV of the existing solution set. (b) Yellow area depicts the HV improvement after adding $\boldsymbol{y}_4$.
  • Figure 5: Probability density function (PDF) of the request input and output lengths for four application request traces.
  • ...and 11 more figures