Table of Contents
Fetching ...

Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang

TL;DR

This work tackles inefficient batch serving in language-model-as-a-service (LMaaS) caused by unknown request generation lengths, which inflates computation and memory waste. It introduces Magnus, a modular system that predicts per-request generation length from user input length and semantic features (via LaBSE) and then uses a WMA-directed adaptive batcher, a KNN-based serving-time estimator, and an HRRN scheduler to group and order batches. Empirical results on a multi-application, GPU-based testbed show Magnus increases request throughput by up to 234% and reduces average/ tail latency by up to 89.7%, with negligible overhead. The approach enables more efficient GPU utilization in LMaaS settings, offering practical QoS gains for providers and developers.

Abstract

Nowadays, large language models (LLMs) are published as a service and can be accessed by various applications via APIs, also known as language-model-as-a-service (LMaaS). Without knowing the generation length of requests, existing serving systems serve requests in a first-come, first-served (FCFS) manner with a fixed batch size, which leads to two problems that affect batch serving efficiency. First, the generation lengths of requests in a batch vary, and requests with short generation lengths must wait for requests with long generation lengths to finish during the batch serving procedure. Second, requests with longer generation lengths consume more memory during serving. Without knowing the generation lengths of batched requests, the batch size is always set small to avoid the out-of-memory (OOM) error, thus preventing the GPU from being fully utilized. In this paper, we find that a significant number of popular applications in the LMaaS scenario have a positive correlation between the generation length and the length of raw user input. Based on this observation, we propose Magnus, which can accurately predict the request generation length with the user input length, application-level, and user-level semantic features. Accordingly, Magnus can achieve high request throughput by batching requests of similar generation lengths together with adaptive batch sizes. Besides, Magnus can also schedule batches with the highest response ratio next (HRRN) policy to reduce request response time. Experiments conducted on our testbed show that Magnus improves request throughput by up to 234\% and reduces response time by up to 89.7\% compared to baselines.

Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

TL;DR

This work tackles inefficient batch serving in language-model-as-a-service (LMaaS) caused by unknown request generation lengths, which inflates computation and memory waste. It introduces Magnus, a modular system that predicts per-request generation length from user input length and semantic features (via LaBSE) and then uses a WMA-directed adaptive batcher, a KNN-based serving-time estimator, and an HRRN scheduler to group and order batches. Empirical results on a multi-application, GPU-based testbed show Magnus increases request throughput by up to 234% and reduces average/ tail latency by up to 89.7%, with negligible overhead. The approach enables more efficient GPU utilization in LMaaS settings, offering practical QoS gains for providers and developers.

Abstract

Nowadays, large language models (LLMs) are published as a service and can be accessed by various applications via APIs, also known as language-model-as-a-service (LMaaS). Without knowing the generation length of requests, existing serving systems serve requests in a first-come, first-served (FCFS) manner with a fixed batch size, which leads to two problems that affect batch serving efficiency. First, the generation lengths of requests in a batch vary, and requests with short generation lengths must wait for requests with long generation lengths to finish during the batch serving procedure. Second, requests with longer generation lengths consume more memory during serving. Without knowing the generation lengths of batched requests, the batch size is always set small to avoid the out-of-memory (OOM) error, thus preventing the GPU from being fully utilized. In this paper, we find that a significant number of popular applications in the LMaaS scenario have a positive correlation between the generation length and the length of raw user input. Based on this observation, we propose Magnus, which can accurately predict the request generation length with the user input length, application-level, and user-level semantic features. Accordingly, Magnus can achieve high request throughput by batching requests of similar generation lengths together with adaptive batch sizes. Besides, Magnus can also schedule batches with the highest response ratio next (HRRN) policy to reduce request response time. Experiments conducted on our testbed show that Magnus improves request throughput by up to 234\% and reduces response time by up to 89.7\% compared to baselines.
Paper Structure (21 sections, 5 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: LMaaS scenario.
  • Figure 2: Apps with a strong positive correlation between the user input length and the request generation length.
  • Figure 3: LLM inference procedure. The blue and green rounded rectangles depict the computation flow of the transformer block and masked-self attention, respectively.
  • Figure 4: Key-value cache usage in two phases. The request has 3 tokens and the gray and orange grids represent the newly derived and reused key and value tensors, respectively.
  • Figure 5: Batch serving procedure for LLMs. The blue and green grids represent tokens of the request and valid tokens generated by the LLM, respectively.
  • ...and 9 more figures