Table of Contents
Fetching ...

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng

TL;DR

BatchLLM tackles throughput bottlenecks in large-batch LLM inference by making global prefix sharing explicit, replacing implicit LRU KV caching with a DP-based single-level prefix representation, and organizing requests into prefix-sharing groups. It introduces throughput-oriented token batching, scheduling strategies that prioritize decoding-heavy groups, and a horizontally fused prefix-shared Attention kernel to reduce tail latency. Empirical results across microbenchmarks and industry workloads show 1.3×–10.8× end-to-end speedups over vLLM and SGLang, with substantial gains in KV reuse and token batching efficiency. These innovations enable higher GPU utilization and scalable, high-throughput LLM serving in real-world information processing tasks.

Abstract

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks. Many of these tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Finally, BatchLLM optimizes the prefix-shared Attention kernel with horizontal fusion to reduce tail effect and kernel launch overhead. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by 1.3$\times$ to 10.8$\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments.

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

TL;DR

BatchLLM tackles throughput bottlenecks in large-batch LLM inference by making global prefix sharing explicit, replacing implicit LRU KV caching with a DP-based single-level prefix representation, and organizing requests into prefix-sharing groups. It introduces throughput-oriented token batching, scheduling strategies that prioritize decoding-heavy groups, and a horizontally fused prefix-shared Attention kernel to reduce tail latency. Empirical results across microbenchmarks and industry workloads show 1.3×–10.8× end-to-end speedups over vLLM and SGLang, with substantial gains in KV reuse and token batching efficiency. These innovations enable higher GPU utilization and scalable, high-throughput LLM serving in real-world information processing tasks.

Abstract

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks. Many of these tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Finally, BatchLLM optimizes the prefix-shared Attention kernel with horizontal fusion to reduce tail effect and kernel launch overhead. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by 1.3 to 10.8 on a set of microbenchmarks and a typical industry workload under different hardware environments.

Paper Structure

This paper contains 29 sections, 4 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: The effect of processing order of requests with chunked-prefill enabled. Given the three requests with different prefill/decoding length characteristics, the naive token batching in the coming order of the requests has worse token mixing of decoding and prefill chunks.
  • Figure 2: The token number in the batch processed at each iteration for an industry task with vLLM's chunked-prefill. It has "valleys" for many iterations.
  • Figure 3: BatchLLM overview.
  • Figure 4: The preprocessing to maximize the first level prefix reusing. It converts from the initial prefix tree (a) to the final prefix tree (c). Each cycle represents a node containing a number of tokens. The number in the cycle is the token number of the node. It iterates bottom-up to maximize the first level reusing recursively until reaching the root node.
  • Figure 5: Microbenchmark evaluation. The setting m/n (like 2000/200) indicates the length of shared prefix/non-shared context, sd means sharing degree. The vLLM setting with '+ p' ('+ c') means prefix-caching (chunked-prefill) enabled.
  • ...and 5 more figures