Table of Contents
Fetching ...

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

Bowen Pang, Kai Li, Feifan Wang

TL;DR

This work tackles the challenge of optimizing LLM inference throughput under memory and latency constraints by reframing batch size as a real-time control variable. It introduces two algorithms: a memory-aware dynamic batching method and an SLA-constrained variant, both formulated as online optimization problems and validated on vLLM with real prompts. Theoretical modeling links batch size, memory usage, and decoding latency to throughput, and empirical results show up to 28% throughput gains and 22% SLA-capacity gains, while preserving compatibility with existing infrastructure. The approach offers a practical framework for balancing memory, throughput, and QoS in modern LLM deployments, with avenues for future enhancements in MoE models and RLHF-integrated pipelines.

Abstract

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes significant limitations in static batching methods. Current inference serving systems often treat batch sizes as fixed hyper-parameters, hindering real-time adaptation to varying system conditions. In this paper, we propose a dynamic batching method that continuously monitors memory utilization and adheres to service-level agreements (SLAs) to enable real-time batch size configuration adjustment. The method comprises two core components: a memory-aware batch scheduler that dynamically allocates GPU resources and a latency feedback mechanism that optimizes decoding processes under SLA constraints. The numerical experiments demonstrate throughput gains of 8% to 28% and capacity improvements of 22% compared to traditional static batching methods, while maintaining full compatibility with existing inference infrastructure. These results highlight the effectiveness of dynamic batching in balancing computational efficiency and quality-of-service requirements for contemporary LLM deployment scenarios. The source code of this work is publicly available at https://github.com/KevinLee1110/dynamic-batching.

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

TL;DR

This work tackles the challenge of optimizing LLM inference throughput under memory and latency constraints by reframing batch size as a real-time control variable. It introduces two algorithms: a memory-aware dynamic batching method and an SLA-constrained variant, both formulated as online optimization problems and validated on vLLM with real prompts. Theoretical modeling links batch size, memory usage, and decoding latency to throughput, and empirical results show up to 28% throughput gains and 22% SLA-capacity gains, while preserving compatibility with existing infrastructure. The approach offers a practical framework for balancing memory, throughput, and QoS in modern LLM deployments, with avenues for future enhancements in MoE models and RLHF-integrated pipelines.

Abstract

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes significant limitations in static batching methods. Current inference serving systems often treat batch sizes as fixed hyper-parameters, hindering real-time adaptation to varying system conditions. In this paper, we propose a dynamic batching method that continuously monitors memory utilization and adheres to service-level agreements (SLAs) to enable real-time batch size configuration adjustment. The method comprises two core components: a memory-aware batch scheduler that dynamically allocates GPU resources and a latency feedback mechanism that optimizes decoding processes under SLA constraints. The numerical experiments demonstrate throughput gains of 8% to 28% and capacity improvements of 22% compared to traditional static batching methods, while maintaining full compatibility with existing inference infrastructure. These results highlight the effectiveness of dynamic batching in balancing computational efficiency and quality-of-service requirements for contemporary LLM deployment scenarios. The source code of this work is publicly available at https://github.com/KevinLee1110/dynamic-batching.

Paper Structure

This paper contains 9 sections, 11 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Dynamic batching as a real-time control problem
  • Figure 2: Dynamic batching according to memory use
  • Figure 3: Relationship among dynamic batch size, inference throughput, and decoding time
  • Figure 4: Capacity with SLA 50ms: dynamic vs. static batching