Table of Contents
Fetching ...

Scalable Reinforcement Learning for Virtual Machine Scheduling

Junjie Sheng, Jiehao Wu, Haochuan Cui, Yiqiu Hu, Wenli Zhou, Lei Zhu, Qian Peng, Wenhao Li, Xiangfeng Wang

TL;DR

This work tackles the scalability barrier of applying reinforcement learning to virtual machine scheduling (VMS) in large cloud clusters. It introduces Cluster Value Decomposition Reinforcement Learning (CVD-RL), which decomposes cluster value into per-PM components, uses a look-ahead mechanism to simplify state representations, and constrains exploration with a Top-k action filter. The framework achieves scalable learning that remains effective as the number of PMs grows (up to 50 PMs in Huawei Cloud data) and demonstrates strong generalization across different warm-starts, PM counts, and expansion scenarios, outperforming several state-of-the-art baselines in key metrics like scheduled length, CPU utilization, and income. By enabling near-optimal, scalable RL for VMS in large-scale cloud environments, CVD-RL offers a practical path toward deploying RL-based resource management in real-world data centers, with potential applicability to other multi-dimensional scheduling problems.

Abstract

Recent advancements in reinforcement learning (RL) have shown promise for optimizing virtual machine scheduling (VMS) in small-scale clusters. The utilization of RL to large-scale cloud computing scenarios remains notably constrained. This paper introduces a scalable RL framework, called Cluster Value Decomposition Reinforcement Learning (CVD-RL), to surmount the scalability hurdles inherent in large-scale VMS. The CVD-RL framework innovatively combines a decomposition operator with a look-ahead operator to adeptly manage representation complexities, while complemented by a Top-$k$ filter operator that refines exploration efficiency. Different from existing approaches limited to clusters of $10$ or fewer physical machines (PMs), CVD-RL extends its applicability to environments encompassing up to $50$ PMs. Furthermore, the CVD-RL framework demonstrates generalization capabilities that surpass contemporary SOTA methodologies across a variety of scenarios in empirical studies. This breakthrough not only showcases the framework's exceptional scalability and performance but also represents a significant leap in the application of RL for VMS within complex, large-scale cloud infrastructures. The code is available at https://anonymous.4open.science/r/marl4sche-D0FE.

Scalable Reinforcement Learning for Virtual Machine Scheduling

TL;DR

This work tackles the scalability barrier of applying reinforcement learning to virtual machine scheduling (VMS) in large cloud clusters. It introduces Cluster Value Decomposition Reinforcement Learning (CVD-RL), which decomposes cluster value into per-PM components, uses a look-ahead mechanism to simplify state representations, and constrains exploration with a Top-k action filter. The framework achieves scalable learning that remains effective as the number of PMs grows (up to 50 PMs in Huawei Cloud data) and demonstrates strong generalization across different warm-starts, PM counts, and expansion scenarios, outperforming several state-of-the-art baselines in key metrics like scheduled length, CPU utilization, and income. By enabling near-optimal, scalable RL for VMS in large-scale cloud environments, CVD-RL offers a practical path toward deploying RL-based resource management in real-world data centers, with potential applicability to other multi-dimensional scheduling problems.

Abstract

Recent advancements in reinforcement learning (RL) have shown promise for optimizing virtual machine scheduling (VMS) in small-scale clusters. The utilization of RL to large-scale cloud computing scenarios remains notably constrained. This paper introduces a scalable RL framework, called Cluster Value Decomposition Reinforcement Learning (CVD-RL), to surmount the scalability hurdles inherent in large-scale VMS. The CVD-RL framework innovatively combines a decomposition operator with a look-ahead operator to adeptly manage representation complexities, while complemented by a Top- filter operator that refines exploration efficiency. Different from existing approaches limited to clusters of or fewer physical machines (PMs), CVD-RL extends its applicability to environments encompassing up to PMs. Furthermore, the CVD-RL framework demonstrates generalization capabilities that surpass contemporary SOTA methodologies across a variety of scenarios in empirical studies. This breakthrough not only showcases the framework's exceptional scalability and performance but also represents a significant leap in the application of RL for VMS within complex, large-scale cloud infrastructures. The code is available at https://anonymous.4open.science/r/marl4sche-D0FE.

Paper Structure

This paper contains 29 sections, 19 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Top: The theoretical state-action spaces; Bottom: The converged performances. The motivating cases for our method (CVD-RL). The left figure depicts the state-action spaces under different numbers of physical machines (PMs). For SchedRL, these spaces expand exponentially with an increasing number of PMs, in contrast to CVD-RL, where they remain constant. The right figure demonstrates the converged performance of various methods across different PM counts. Notably, the performance disparity between SchedRL and CVD-RL widens as the number of PMs grows.
  • Figure 2: Overview of Virtual Machine Scheduling.
  • Figure 3: VMS dynamics. The process begins with the initial cluster status $\boldsymbol{s}^c(t)$ and the current allocation VM request $\boldsymbol{s}^v(t)$. The scheduling agent selects a scheduling action $\boldsymbol{a}(t)$ accordingly, causing the cluster to transition to $\hat{\boldsymbol{s}}^c(t)$ based on the action and the double-NUMA set $\mathbb{O}$. Afterward, the system handles release requests through a function $f$ and transitions to $\boldsymbol{s}^c(t+1)$, continuing to handle the next VM request.
  • Figure 4: CVD-RL's scalability is enhanced by two key components: Cluster Value Representation and Dynamic Action Space Construction. The lower left segment illustrates the decomposition and look-ahead operator, which expresses the cluster's value as the aggregate of individual physical machines' values. Meanwhile, the bottom right details how the top-$k$ filter operator dynamically constructs an action space, effectively reducing the action space from linear growth to a fixed size of $k$.
  • Figure 5: Runtime comparisons on Non-Expansion scenarios with increasing numbers of PMs, highlighting the trend of increased sampling time as the cluster size grows.
  • ...and 7 more figures