Table of Contents
Fetching ...

Ksurf-Drone: Attention Kalman Filter for Contextual Bandit Optimization in Cloud Resource Allocation

Michael Dang'ana, Yuqiu Zhang, Hans-Arno Jacobsen

TL;DR

Ksurf-Drone addresses cloud container orchestration under high workload and resource variability by fusing an attention-augmented Extended Kalman Filter (Ksurf) with Drone's contextual bandit optimization. The approach introduces KsurfNet to automatically parameterize EKF noise terms and provides bounded regret with exponential convergence, evaluated across Google Cloud, Compute Canada, and VarBench benchmarks. Key findings include up to 41% latency variance reduction at p95 and 47% at p99, plus ~7% worker pod count cost savings, with modest scaler/master memory overhead. The results suggest online Kalman-filter-based contextual bandits offer robust, scalable resource management in highly variable cloud environments, outperforming GP-based batch methods in short-horizon, real-time settings.

Abstract

Resource orchestration and configuration parameter search are key concerns for container-based infrastructure in cloud data centers. Large configuration search space and cloud uncertainties are often mitigated using contextual bandit techniques for resource orchestration including the state-of-the-art Drone orchestrator. Complexity in the cloud provider environment due to varying numbers of virtual machines introduces variability in workloads and resource metrics, making orchestration decisions less accurate due to increased nonlinearity and noise. Ksurf, a state-of-the-art variance-minimizing estimator method ideal for highly variable cloud data, enables optimal resource estimation under conditions of high cloud variability. This work evaluates the performance of Ksurf on estimation-based resource orchestration tasks involving highly variable workloads when employed as a contextual multi-armed bandit objective function model for cloud scenarios using Drone. Ksurf enables significantly lower latency variance of $41\%$ at p95 and $47\%$ at p99, demonstrates a $4\%$ reduction in CPU usage and 7 MB reduction in master node memory usage on Kubernetes, resulting in a $7\%$ cost savings in average worker pod count on VarBench Kubernetes benchmark.

Ksurf-Drone: Attention Kalman Filter for Contextual Bandit Optimization in Cloud Resource Allocation

TL;DR

Ksurf-Drone addresses cloud container orchestration under high workload and resource variability by fusing an attention-augmented Extended Kalman Filter (Ksurf) with Drone's contextual bandit optimization. The approach introduces KsurfNet to automatically parameterize EKF noise terms and provides bounded regret with exponential convergence, evaluated across Google Cloud, Compute Canada, and VarBench benchmarks. Key findings include up to 41% latency variance reduction at p95 and 47% at p99, plus ~7% worker pod count cost savings, with modest scaler/master memory overhead. The results suggest online Kalman-filter-based contextual bandits offer robust, scalable resource management in highly variable cloud environments, outperforming GP-based batch methods in short-horizon, real-time settings.

Abstract

Resource orchestration and configuration parameter search are key concerns for container-based infrastructure in cloud data centers. Large configuration search space and cloud uncertainties are often mitigated using contextual bandit techniques for resource orchestration including the state-of-the-art Drone orchestrator. Complexity in the cloud provider environment due to varying numbers of virtual machines introduces variability in workloads and resource metrics, making orchestration decisions less accurate due to increased nonlinearity and noise. Ksurf, a state-of-the-art variance-minimizing estimator method ideal for highly variable cloud data, enables optimal resource estimation under conditions of high cloud variability. This work evaluates the performance of Ksurf on estimation-based resource orchestration tasks involving highly variable workloads when employed as a contextual multi-armed bandit objective function model for cloud scenarios using Drone. Ksurf enables significantly lower latency variance of at p95 and at p99, demonstrates a reduction in CPU usage and 7 MB reduction in master node memory usage on Kubernetes, resulting in a cost savings in average worker pod count on VarBench Kubernetes benchmark.

Paper Structure

This paper contains 21 sections, 2 theorems, 7 equations, 22 figures, 2 tables.

Key Result

Lemma 1

Figures (22)

  • Figure 1: Ksurf-Drone Optimization Architecture
  • Figure 2: KsurfNet Component Architecture
  • Figure 3: KsurfNet MAE comparison to EKF MAE
  • Figure 4: KsurfNet MAE by Epoch
  • Figure 5: Google Cloud Mean Request Latency by Threshold
  • ...and 17 more figures

Theorems & Definitions (2)

  • Lemma 1
  • Corollary 2