Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

Yihong Jin; Ze Yang

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

Yihong Jin, Ze Yang

TL;DR

The paper tackles the scalability of cloud-based AI inference services under dynamic workloads by introducing a hybrid framework that combines MADRL-based real-time load balancing with GC N-enhanced state representation and a GA-PSO autoscaling module. Decentralised agents coordinate resource allocation, leveraging a graph-aware policy and a multi-objective optimizer to minimize latency and cost. Experimental validation on Google Cluster Data shows substantial improvements in load-balancing efficiency and latency (e.g., ~35% and ~28%, respectively) over traditional baselines, highlighting the practical impact for real-time cloud inference. The work advances scalable cloud AI inference by integrating complementary ML techniques and decentralised control, with potential applicability across cloud environments.

Abstract

The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud AI inference services, focusing on real-time load balancing and autoscaling strategies. The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting. This multi-layered approach enables the system to anticipate workload fluctuations and proactively adjust resources, ensuring maximum resource utilisation and minimising latency. Furthermore, the incorporation of a decentralised decision-making process within the model serves to enhance fault tolerance and reduce response time in scaling operations. Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35\ and reduces response delay by 28\, thereby exhibiting a substantial optimization effect in comparison with conventional scalability solutions.

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

TL;DR

Abstract

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)