BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng; Xin Ji; Taosong Fang; Fanghao Zhou; Chuanjie Liu; Gang Peng

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng

TL;DR

BatchLLM tackles throughput bottlenecks in large-batch LLM inference by making global prefix sharing explicit, replacing implicit LRU KV caching with a DP-based single-level prefix representation, and organizing requests into prefix-sharing groups. It introduces throughput-oriented token batching, scheduling strategies that prioritize decoding-heavy groups, and a horizontally fused prefix-shared Attention kernel to reduce tail latency. Empirical results across microbenchmarks and industry workloads show 1.3×–10.8× end-to-end speedups over vLLM and SGLang, with substantial gains in KV reuse and token batching efficiency. These innovations enable higher GPU utilization and scalable, high-throughput LLM serving in real-world information processing tasks.

Abstract

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks. Many of these tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Finally, BatchLLM optimizes the prefix-shared Attention kernel with horizontal fusion to reduce tail effect and kernel launch overhead. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by 1.3$\times$ to 10.8$\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments.

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

TL;DR

Abstract

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)