Ksurf-Drone: Attention Kalman Filter for Contextual Bandit Optimization in Cloud Resource Allocation
Michael Dang'ana, Yuqiu Zhang, Hans-Arno Jacobsen
TL;DR
Ksurf-Drone addresses cloud container orchestration under high workload and resource variability by fusing an attention-augmented Extended Kalman Filter (Ksurf) with Drone's contextual bandit optimization. The approach introduces KsurfNet to automatically parameterize EKF noise terms and provides bounded regret with exponential convergence, evaluated across Google Cloud, Compute Canada, and VarBench benchmarks. Key findings include up to 41% latency variance reduction at p95 and 47% at p99, plus ~7% worker pod count cost savings, with modest scaler/master memory overhead. The results suggest online Kalman-filter-based contextual bandits offer robust, scalable resource management in highly variable cloud environments, outperforming GP-based batch methods in short-horizon, real-time settings.
Abstract
Resource orchestration and configuration parameter search are key concerns for container-based infrastructure in cloud data centers. Large configuration search space and cloud uncertainties are often mitigated using contextual bandit techniques for resource orchestration including the state-of-the-art Drone orchestrator. Complexity in the cloud provider environment due to varying numbers of virtual machines introduces variability in workloads and resource metrics, making orchestration decisions less accurate due to increased nonlinearity and noise. Ksurf, a state-of-the-art variance-minimizing estimator method ideal for highly variable cloud data, enables optimal resource estimation under conditions of high cloud variability. This work evaluates the performance of Ksurf on estimation-based resource orchestration tasks involving highly variable workloads when employed as a contextual multi-armed bandit objective function model for cloud scenarios using Drone. Ksurf enables significantly lower latency variance of $41\%$ at p95 and $47\%$ at p99, demonstrates a $4\%$ reduction in CPU usage and 7 MB reduction in master node memory usage on Kubernetes, resulting in a $7\%$ cost savings in average worker pod count on VarBench Kubernetes benchmark.
