Table of Contents
Fetching ...

Towards VM Rescheduling Optimization Through Deep Reinforcement Learning

Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du

TL;DR

The paper tackles VM rescheduling in data centers under stringent inference-time constraints by formulating an RL-based solution, VMR$^2$L, that uses a two-stage action decomposition, sparse attention for scalable relational state representations, and risk-seeking evaluation to trade latency for solution quality. It demonstrates that VMR$^2$L can achieve FR close to a near-optimal MIP while delivering decisions in seconds, vastly outperforming heuristic baselines and traditional optimization approaches in large-scale settings. The authors provide extensive evaluations across real datasets, multiple constraints, mixed objectives, and broad generalization scenarios, and they release datasets and an RL gym environment to facilitate further research. The practical impact is a scalable, adaptable VM rescheduling framework capable of reducing fragmentation in industrial data centers without sacrificing latency budgets, with direct applicability to production environments and potential for broader system optimization use.

Abstract

Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: https://github.com/zhykoties/VMR2L_eurosys, https://drive.google.com/drive/folders/1PfRo1cVwuhH30XhsE2Np3xqJn2GpX5qy.

Towards VM Rescheduling Optimization Through Deep Reinforcement Learning

TL;DR

The paper tackles VM rescheduling in data centers under stringent inference-time constraints by formulating an RL-based solution, VMRL, that uses a two-stage action decomposition, sparse attention for scalable relational state representations, and risk-seeking evaluation to trade latency for solution quality. It demonstrates that VMRL can achieve FR close to a near-optimal MIP while delivering decisions in seconds, vastly outperforming heuristic baselines and traditional optimization approaches in large-scale settings. The authors provide extensive evaluations across real datasets, multiple constraints, mixed objectives, and broad generalization scenarios, and they release datasets and an RL gym environment to facilitate further research. The practical impact is a scalable, adaptable VM rescheduling framework capable of reducing fragmentation in industrial data centers without sacrificing latency budgets, with direct applicability to production environments and potential for broader system optimization use.

Abstract

Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: https://github.com/zhykoties/VMR2L_eurosys, https://drive.google.com/drive/folders/1PfRo1cVwuhH30XhsE2Np3xqJn2GpX5qy.

Paper Structure

This paper contains 30 sections, 5 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: The number of VM arrivals and exits per minute. The green line indicates a continuous VMS process over 24 hours.
  • Figure 2: VMS process. The green number 1 denotes the VMS operation, selecting PMs for incoming VM requests.
  • Figure 3: VMR process. The red number 2 highlights the off-peak period when VMR is typically performed.
  • Figure 4: FR and inference time at different MNLs.
  • Figure 5: Effect of inference time on achieved performance.
  • ...and 16 more figures