Table of Contents
Fetching ...

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

Francesco Lettich, Emanuele Carlini, Franco Maria Nardini, Raffaele Perego, Salvatore Trani

TL;DR

This work tackles online scheduling in GPU datacenters handling hybrid ML workloads with two objectives: minimize GPU fragmentation and reduce power consumption. It introduces PWR, a power-aware scheduling policy implemented as a Kubernetes score plugin, and combines it with Fragmentation Gradient Descent (FGD) to balance power and fragmentation. Using real Alibaba traces and a detailed GPU-CPU power model, the authors demonstrate substantial power savings (often exceeding 10–20%) with PWR and PWR+FGD while maintaining high GPU utilization and acceptable fragmentation metrics. The results underscore the practical potential of integrating power-aware decisions into online GPU scheduling to improve energy efficiency in large-scale GPU datacenters.

Abstract

The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

TL;DR

This work tackles online scheduling in GPU datacenters handling hybrid ML workloads with two objectives: minimize GPU fragmentation and reduce power consumption. It introduces PWR, a power-aware scheduling policy implemented as a Kubernetes score plugin, and combines it with Fragmentation Gradient Descent (FGD) to balance power and fragmentation. Using real Alibaba traces and a detailed GPU-CPU power model, the authors demonstrate substantial power savings (often exceeding 10–20%) with PWR and PWR+FGD while maintaining high GPU utilization and acceptable fragmentation metrics. The results underscore the practical potential of integrating power-aware decisions into online GPU scheduling to improve energy efficiency in large-scale GPU datacenters.

Abstract

The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

Paper Structure

This paper contains 23 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: FGD EOPC (in MW) for workloads from the Default trace, with stacked CPU and GPU components. The dashed line shows the fraction of GPU power (see right-hand y-axis).
  • Figure 2: Power savings (in percentage w.r.t. FGD, top plot) and GRAR scores (bottom plot) measured for PWR and its linear combinations with FGD, workloads from Default trace.
  • Figure 3: Power savings with workloads from the Default trace.
  • Figure 4: Power savings with sharing-GPU workloads -- case in which sharing-GPU tasks request 100% of GPU resources.
  • Figure 5: Power savings with multi-GPU workloads.
  • ...and 5 more figures