Table of Contents
Fetching ...

SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services

Yaodan Xu, Sheng Zhou, Zhisheng Niu

TL;DR

The paper addresses dynamic batching for online batch-serving with size-dependent service times by formulating the problem as an infinite-state SMDP that minimizes the weighted sum of latency and energy, under Poisson arrivals. It introduces a finite-state approximation with an abstract tail cost, a discretization step to a DTMDP, and a relative value iteration solver, enabling tractable offline computation of near-optimal batching policies. Key contributions include a rigorous SMDP formulation for batch-service queues with size-dependent processing, substantial reductions in computational complexity from tail abstraction (e.g., up to 63.5% space and 98% time), and extensive numerical results showing that SMDP-derived policies achieve superior latency-energy tradeoffs and lighter tail latency compared to benchmark batching schemes. The findings demonstrate practical impact for ML inference serving and online computing, offering a flexible framework to balance responsiveness and energy efficiency in batch-enabled servers, with clear avenues for extension to multi-processor systems and bursty traffic regimes.

Abstract

For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.

SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services

TL;DR

The paper addresses dynamic batching for online batch-serving with size-dependent service times by formulating the problem as an infinite-state SMDP that minimizes the weighted sum of latency and energy, under Poisson arrivals. It introduces a finite-state approximation with an abstract tail cost, a discretization step to a DTMDP, and a relative value iteration solver, enabling tractable offline computation of near-optimal batching policies. Key contributions include a rigorous SMDP formulation for batch-service queues with size-dependent processing, substantial reductions in computational complexity from tail abstraction (e.g., up to 63.5% space and 98% time), and extensive numerical results showing that SMDP-derived policies achieve superior latency-energy tradeoffs and lighter tail latency compared to benchmark batching schemes. The findings demonstrate practical impact for ML inference serving and online computing, offering a flexible framework to balance responsiveness and energy efficiency in batch-enabled servers, with clear avenues for extension to multi-processor systems and bursty traffic regimes.

Abstract

For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.
Paper Structure (27 sections, 6 theorems, 34 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 6 theorems, 34 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

An average expected optimal stationary deterministic policy exists for the SMDP model $\mathcal{P}$.

Figures (11)

  • Figure 1: Batching of the same type of inference requests from potentially different users on a GPU-based ML-as-a-Service (MLaaS) platform.
  • Figure 2: Inference latency and energy consumption for batch processing GoogLeNetszegedy2015going on TESLA P4 and TESLA V100. The data, measured by NVIDIA, is based on an image classification task using images from the ImageNet12 dataset, which consists of 1000 classes with an image size of $224 \times 224$NVIDIA. The batch size is plotted in $\log_2$ coordinate.
  • Figure 3: The converged SMDP solutions under various parameter settings. The maximum batch size is chosen as $B_{\max}=8$. The weights are (a) $[w_1,w_2]=[1,0]$, (b) $[w_1,w_2]=[1,0.5]$, (c) $[w_1,w_2]=[1,1]$ and (d) $[w_1,w_2]=[1,100]$. The normalized traffic intensity $\rho$ varies in $\{0.1,0.3,0.5,0.7,0.9\}$. All the solutions exhibit a control limit structure, with the control limits highlighted by pink boxes.
  • Figure 4: Comparison of different policies on the average cost per unit time under $\rho=0.1, 0.3, 0.7$, with $w_1=1$ and $w_2$ ranging from $0$ to $15$.
  • Figure 5: The latency-energy tradeoff curves for different policies under various load conditions.
  • ...and 6 more figures

Theorems & Definitions (13)

  • Remark 1
  • Remark 2
  • Remark 3
  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • Definition 3
  • Proposition 3
  • Proposition 4: Refer to Section 6 in deb1973optimal
  • ...and 3 more