Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost Efficiency
Chunyang Meng, Haogang Tong, Tianyang Wu, Maolin Pan, Yang Yu
TL;DR
BAScaler tackles the challenge of autoscaling under dynamic and bursty workloads to guarantee SLOs while reducing costs. It combines a prediction-based burst detector, an AR-bootstrapped burst overestimation, an SVR-based performance estimator for bursts, and a PPO-driven estimation enhancer to adapt resource provisioning in both bursty and non-bursting regimes. The system operates within a MAPE framework and integrates with Kubernetes, Istio, and Prometheus to monitor, predict, and adjust resources before demand spikes. Experimental results on ten real-world traces show substantial reductions in SLO violations and request errors, along with meaningful cost efficiency, validating the effectiveness of the multi-level ML approach. The work highlights the practicality of fine-grained, burst-aware autoscaling for containerized cloud services and provides a public implementation for broader use.
Abstract
Autoscaling is a technology to automatically scale the resources provided to their applications without human intervention to guarantee runtime Quality of Service (QoS) while saving costs. However, user-facing cloud applications serve dynamic workloads that often exhibit variable and contain bursts, posing challenges to autoscaling for maintaining QoS within Service-Level Objectives (SLOs). Conservative strategies risk over-provisioning, while aggressive ones may cause SLO violations, making it more challenging to design effective autoscaling. This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads, combining multi-level machine learning (ML) techniques to mitigate SLO violations while saving costs. BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts. When bursts are detected, BAScaler appropriately overestimates them and allocates resources accordingly to address the rapid growth in resource demand. On the other hand, BAScaler employs reinforcement learning to rectify potential inaccuracies in resource estimation, enabling more precise resource allocation during non-bursts. Experiments across ten real-world workloads demonstrate BAScaler's effectiveness, achieving a 57% average reduction in SLO violations and cutting resource costs by 10% compared to other prominent methods.
