Table of Contents
Fetching ...

EaCO: Resource Sharing Dynamics and Its Impact on Energy Efficiency for DNN Training

Kawsar Haghshenas, Mona Hashemi

TL;DR

EaCO addresses energy inefficiency in DLT training on shared GPU clusters by enabling GPU sharing with an energy-aware scheduler. It combines hardware-supported context switching with predictions from historical experiments and early-stage observations to bound performance while reducing energy. The approach demonstrates energy reductions up to 39% in large-scale traces and up to 44% per-job energy savings in co-location experiments, with modest increases in JCT. This work offers practical benefits for cloud providers and clusters by improving energy efficiency without sacrificing SLOs and provides a foundation for energy-aware co-allocation in DLT workloads.

Abstract

Deep Learning Training (DLT) is a growing workload in shared GPU/CPU clusters due to its high computational cost and increasing number of jobs. This contributes to significant energy consumption in GPU clusters, further exacerbated by GPU under-utilization, as shown in production cluster logs. Addressing this challenge requires workload scheduling and resource allocation policies for efficient GPU sharing to improve resource and energy efficiency while maintaining performance. However, previous works primarily optimize for performance, often overlooking or even sacrificing energy efficiency. In this paper, we present EaCO, the first energy-aware scheduling algorithm designed specifically for DLT workloads in GPU clusters. EaCO leverages hardware-supported context switching to enable GPU sharing across multiple DLT jobs, improving resource and energy utilization. GPU sharing can increase Job Completion Time (JCT) and may lead to contention if not employed carefully. To address this, EaCO integrates experiment and historical-based predictions as well as early-stage observations, ensuring performance expectations are met while optimizing energy efficiency. We begin by experimentally exploring the dynamics of co-locating DLTs, investigating its impact on energy and resource utilization. Our results show that co-location improves energy efficiency by up to 44% for individual jobs, and increases average GPU utilization to as high as 97%. Additionally, evaluations on large-scale clusters using production traces demonstrate that EaCO reduces total energy by up to 39% compared to existing algorithms, which comes with a minimal increase in job runtime-less than 3.2% in our simulations.

EaCO: Resource Sharing Dynamics and Its Impact on Energy Efficiency for DNN Training

TL;DR

EaCO addresses energy inefficiency in DLT training on shared GPU clusters by enabling GPU sharing with an energy-aware scheduler. It combines hardware-supported context switching with predictions from historical experiments and early-stage observations to bound performance while reducing energy. The approach demonstrates energy reductions up to 39% in large-scale traces and up to 44% per-job energy savings in co-location experiments, with modest increases in JCT. This work offers practical benefits for cloud providers and clusters by improving energy efficiency without sacrificing SLOs and provides a foundation for energy-aware co-allocation in DLT workloads.

Abstract

Deep Learning Training (DLT) is a growing workload in shared GPU/CPU clusters due to its high computational cost and increasing number of jobs. This contributes to significant energy consumption in GPU clusters, further exacerbated by GPU under-utilization, as shown in production cluster logs. Addressing this challenge requires workload scheduling and resource allocation policies for efficient GPU sharing to improve resource and energy efficiency while maintaining performance. However, previous works primarily optimize for performance, often overlooking or even sacrificing energy efficiency. In this paper, we present EaCO, the first energy-aware scheduling algorithm designed specifically for DLT workloads in GPU clusters. EaCO leverages hardware-supported context switching to enable GPU sharing across multiple DLT jobs, improving resource and energy utilization. GPU sharing can increase Job Completion Time (JCT) and may lead to contention if not employed carefully. To address this, EaCO integrates experiment and historical-based predictions as well as early-stage observations, ensuring performance expectations are met while optimizing energy efficiency. We begin by experimentally exploring the dynamics of co-locating DLTs, investigating its impact on energy and resource utilization. Our results show that co-location improves energy efficiency by up to 44% for individual jobs, and increases average GPU utilization to as high as 97%. Additionally, evaluations on large-scale clusters using production traces demonstrate that EaCO reduces total energy by up to 39% compared to existing algorithms, which comes with a minimal increase in job runtime-less than 3.2% in our simulations.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: Total energy and average JCT for running a set of jobs, with and without space sharing.
  • Figure 2: Resource utilization (GPU, CPU, and memory), while running different job combinations.
  • Figure 3: Total energy and average job runtime for executing a job trace, normalized to the default FIFO algorithm, across two cluster sizes.
  • Figure 4: Number of active nodes employing different algorithms for a) 28-Node b) 64-Node cluster configurations.