AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

Hamed Hamzeh

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

Hamed Hamzeh

Abstract

State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

Abstract

Paper Structure (45 sections, 15 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 45 sections, 15 equations, 14 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Approach
Problem Formulation as a Multi-Agent System
State and Observations
Actions
Fault Tolerance ($\text{score}_{\text{FT}}$)
Resource Utilization ($\text{score}_{\text{UTIL}}$)
Cost Efficiency ($\text{score}_{\text{COST}}$)
Policy and State Transitions
Reward
System Architecture and Decentralised Execution
Graph Neural Network for Context-Aware Observations
Node Agent Actor-Critic Architecture
Actor Network
...and 30 more sections

Figures (14)

Figure 1: AGMARL-DKS development Pipeline
Figure 2: The high-level architecture design for the AGMARL-DKS implementation in GKE
Figure 3: Pod distribution heatmap at the conclusion of Scenario 1. The default scheduler (left) exhibits a scattered placement. The AGMARL-DKS scheduler (right) demonstrates a consolidation strategy, grouping specific application types onto preferred nodes.
Figure 4: Total requested CPU cores on each node, stacked by application type. The default scheduler (left) creates a relatively even but undifferentiated load. The AGMARL-DKS scheduler (right) creates a more varied load profile, indicating a specialized packing strategy.
Figure 5: Stacked CPU utilization.
...and 9 more figures

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

Abstract

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

Authors

Abstract

Table of Contents

Figures (14)