Table of Contents
Fetching ...

Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions

Amine Barrak, Emna Ksontini

TL;DR

The paper investigates scalable, cost-efficient ML inference by transforming monolithic batch tasks into parallel serverless functions managed via orchestration, demonstrating substantial latency reductions. Using a DistilBERT-based sentiment analysis case study on IMDb, it compares monolithic and parallel executions, showing over 95% speedups with similar costs for larger batch sizes. The work highlights design considerations, such as function orchestration limits, data access via EFS, and explicit cost models, to guide practitioners in deploying large-scale inference in serverless environments. Overall, it provides concrete evidence that re-architecting batch ML inference for parallel serverless execution yields practical performance gains without proportional cost increases, outlining actionable trade-offs for real-world deployments.

Abstract

As data-intensive applications grow, batch processing in limited-resource environments faces scalability and resource management challenges. Serverless computing offers a flexible alternative, enabling dynamic resource allocation and automatic scaling. This paper explores how serverless architectures can make large-scale ML inference tasks faster and cost-effective by decomposing monolithic processes into parallel functions. Through a case study on sentiment analysis using the DistilBERT model and the IMDb dataset, we demonstrate that serverless parallel processing can reduce execution time by over 95% compared to monolithic approaches, at the same cost.

Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions

TL;DR

The paper investigates scalable, cost-efficient ML inference by transforming monolithic batch tasks into parallel serverless functions managed via orchestration, demonstrating substantial latency reductions. Using a DistilBERT-based sentiment analysis case study on IMDb, it compares monolithic and parallel executions, showing over 95% speedups with similar costs for larger batch sizes. The work highlights design considerations, such as function orchestration limits, data access via EFS, and explicit cost models, to guide practitioners in deploying large-scale inference in serverless environments. Overall, it provides concrete evidence that re-architecting batch ML inference for parallel serverless execution yields practical performance gains without proportional cost increases, outlining actionable trade-offs for real-world deployments.

Abstract

As data-intensive applications grow, batch processing in limited-resource environments faces scalability and resource management challenges. Serverless computing offers a flexible alternative, enabling dynamic resource allocation and automatic scaling. This paper explores how serverless architectures can make large-scale ML inference tasks faster and cost-effective by decomposing monolithic processes into parallel functions. Through a case study on sentiment analysis using the DistilBERT model and the IMDb dataset, we demonstrate that serverless parallel processing can reduce execution time by over 95% compared to monolithic approaches, at the same cost.

Paper Structure

This paper contains 9 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Comparison of Monolithic and Parallel-Function Batch Processing Workflows in Serverless Computing
  • Figure : (a) Monolithic Processing
  • Figure : (a) Monolithic Processing
  • Figure : (b) Parallel Processing