AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster

Siyuan Li; Youshao Xiao; Fanzhuang Meng; Lin Ju; Lei Liang; Lin Wang; Jun Zhou

AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster

Siyuan Li, Youshao Xiao, Fanzhuang Meng, Lin Ju, Lei Liang, Lin Wang, Jun Zhou

TL;DR

This work tackles the stability and efficiency challenges of offline batch inference in non-dedicated Kubernetes clusters. It proposes AntBatchInfer, a master–worker framework with four components—Stateful DDS, Data Handler, Elastic Controller, and Elastic Predictor Scheduler—that deliver elastic data distribution, fine-grained fault tolerance, and pipelined execution for both single-model and multi-model batch inference. Key contributions include a three-level fault-tolerance scheme (pod, application, data), a multi-stage, concurrently executing pipeline to hide IO and compute bottlenecks, and a DAG-based multi-model pipeline with per-model GPU assignment to improve throughput. Empirical results show substantial gains over baselines in throughput and efficiency, and real-world deployment at Ant Group demonstrates practical impact; the system scales linearly up to large clusters, validating its applicability to industry-scale workloads. The combination of elasticity, fault tolerance, and model-heterogeneity handling provides a pragmatic solution for enterprise batch inference workloads with complex pipelines.

Abstract

Offline batch inference is a common task in the industry for deep learning applications, but it can be challenging to ensure stability and performance when dealing with large amounts of data and complicated inference pipelines. This paper demonstrated AntBatchInfer, an elastic batch inference framework, which is specially optimized for the non-dedicated cluster. AntBatchInfer addresses these challenges by providing multi-level fault-tolerant capabilities, enabling the stable execution of versatile and long-running inference tasks. It also improves inference efficiency by pipelining, intra-node, and inter-node scaling. It further optimizes the performance in complicated multiple-model batch inference scenarios. Through extensive experiments and real-world statistics, we demonstrate the superiority of our framework in terms of stability and efficiency. In the experiment, it outperforms the baseline by at least $2\times$ and $6\times$ in the single-model or multiple-model batch inference. Also, it is widely used at Ant Group, with thousands of daily jobs from various scenarios, including DLRM, CV, and NLP, which proves its practicability in the industry.

AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster

TL;DR

Abstract

and

in the single-model or multiple-model batch inference. Also, it is widely used at Ant Group, with thousands of daily jobs from various scenarios, including DLRM, CV, and NLP, which proves its practicability in the industry.

Paper Structure (14 sections, 7 figures)

This paper contains 14 sections, 7 figures.

Introduction
Problem Analysis
Our Framework
Framework Architecture
Optimization for the Stability
Pod Fault Tolerance.
Application Fault Tolerance.
Data Fault Tolerance.
Optimization for the Efficiency
Reducing the overall JCT
Speedup Single-model Batch Inference
Speedup Multiple-model Batch Inference Pipeline.
Demonstration
Experiments

Figures (7)

Figure 1: The System overview of the AntBacthInfer
Figure 2: The multi-level fault tolerance.
Figure 3: The user interface of AntBatchInfer.
Figure 4: The throughput between baseline and AntBatchInfer in Single-model Batch Inference.
Figure 5: The throughput between baseline and AntBatchInfer in Multiple-model Batch Inference.
...and 2 more figures

AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster

TL;DR

Abstract

AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster

Authors

TL;DR

Abstract

Table of Contents

Figures (7)