Table of Contents
Fetching ...

Comparison of Autoscaling Frameworks for Containerised Machine-Learning-Applications in a Local and Cloud Environment

Christian Schroeder, Rene Boehm, Alexander Lampe

TL;DR

This work addresses the challenge of autoscaling for containerised ML inference across local and cloud environments by systematically comparing application-level (RayServe, TorchServe) and container-level (K3s) scaling using a multi-model barcode/QR decoding workflow implemented in PyTorch. The study evaluates inference latency and upscaling response times under multi-client load, leveraging metrics such as the moving mean $m_t$ and standard deviation $\sigma_t$, and analyzes resource usage especially for RayServe. Key findings show that local deployments with TorchServe and K3s achieve faster and more stable response times than cloud-based setups, where VM provisioning introduces additional latency and slower scale-out, with AWS ECS generally outperforming AWS EKS in mean and variance of latency. The paper proposes deployment recommendations for local and cloud environments and discusses potential improvements in resource allocation and scaling strategies, including partitioning multi-model applications and exploring distributed GPU inference. These insights have practical impact on configuring autoscaling for production-grade ML inference pipelines while balancing latency, availability, and cost.

Abstract

When deploying machine learning (ML) applications, the automated allocation of computing resources-commonly referred to as autoscaling-is crucial for maintaining a consistent inference time under fluctuating workloads. The objective is to maximize the Quality of Service metrics, emphasizing performance and availability, while minimizing resource costs. In this paper, we compare scalable deployment techniques across three levels of scaling: at the application level (TorchServe, RayServe) and the container level (K3s) in a local environment (production server), as well as at the container and machine levels in a cloud environment (Amazon Web Services Elastic Container Service and Elastic Kubernetes Service). The comparison is conducted through the study of mean and standard deviation of inference time in a multi-client scenario, along with upscaling response times. Based on this analysis, we propose a deployment strategy for both local and cloud-based environments.

Comparison of Autoscaling Frameworks for Containerised Machine-Learning-Applications in a Local and Cloud Environment

TL;DR

This work addresses the challenge of autoscaling for containerised ML inference across local and cloud environments by systematically comparing application-level (RayServe, TorchServe) and container-level (K3s) scaling using a multi-model barcode/QR decoding workflow implemented in PyTorch. The study evaluates inference latency and upscaling response times under multi-client load, leveraging metrics such as the moving mean and standard deviation , and analyzes resource usage especially for RayServe. Key findings show that local deployments with TorchServe and K3s achieve faster and more stable response times than cloud-based setups, where VM provisioning introduces additional latency and slower scale-out, with AWS ECS generally outperforming AWS EKS in mean and variance of latency. The paper proposes deployment recommendations for local and cloud environments and discusses potential improvements in resource allocation and scaling strategies, including partitioning multi-model applications and exploring distributed GPU inference. These insights have practical impact on configuring autoscaling for production-grade ML inference pipelines while balancing latency, availability, and cost.

Abstract

When deploying machine learning (ML) applications, the automated allocation of computing resources-commonly referred to as autoscaling-is crucial for maintaining a consistent inference time under fluctuating workloads. The objective is to maximize the Quality of Service metrics, emphasizing performance and availability, while minimizing resource costs. In this paper, we compare scalable deployment techniques across three levels of scaling: at the application level (TorchServe, RayServe) and the container level (K3s) in a local environment (production server), as well as at the container and machine levels in a cloud environment (Amazon Web Services Elastic Container Service and Elastic Kubernetes Service). The comparison is conducted through the study of mean and standard deviation of inference time in a multi-client scenario, along with upscaling response times. Based on this analysis, we propose a deployment strategy for both local and cloud-based environments.
Paper Structure (10 sections, 3 figures, 4 tables)

This paper contains 10 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Frameworks and scaling level
  • Figure 2: AWS ECS architecture
  • Figure 3: Inference times at scaling test for different deployment methods