Table of Contents
Fetching ...

Reducing Energy Bloat in Large Model Training

Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, Mosharaf Chowdhury

TL;DR

Perseus obtains the time-energy tradeoff frontier of a large model training job using an efficient graph cut-based algorithm, and schedules computation energy consumption across time to reduce both types of energy bloat.

Abstract

Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during training directly contributes to end-to-end throughput; a significant portion can be removed without slowing down training. We call this portion energy bloat. In this work, we identify two independent sources of energy bloat in large model training and propose Perseus, a training system that mitigates both. To do this, Perseus obtains the time--energy tradeoff frontier of a large model training job using an efficient graph cut-based algorithm, and schedules computation energy consumption across time to reduce both types of energy bloat. Evaluation on large models, including GPT-3 and Bloom, shows that Perseus reduces the energy consumption of large model training by up to 30% without any throughput loss or hardware modification.

Reducing Energy Bloat in Large Model Training

TL;DR

Perseus obtains the time-energy tradeoff frontier of a large model training job using an efficient graph cut-based algorithm, and schedules computation energy consumption across time to reduce both types of energy bloat.

Abstract

Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during training directly contributes to end-to-end throughput; a significant portion can be removed without slowing down training. We call this portion energy bloat. In this work, we identify two independent sources of energy bloat in large model training and propose Perseus, a training system that mitigates both. To do this, Perseus obtains the time--energy tradeoff frontier of a large model training job using an efficient graph cut-based algorithm, and schedules computation energy consumption across time to reduce both types of energy bloat. Evaluation on large models, including GPT-3 and Bloom, shows that Perseus reduces the energy consumption of large model training by up to 30% without any throughput loss or hardware modification.
Paper Structure (76 sections, 3 theorems, 12 equations, 13 figures, 11 tables, 3 algorithms)

This paper contains 76 sections, 3 theorems, 12 equations, 13 figures, 11 tables, 3 algorithms.

Key Result

theorem 1

Pipeline Energy Minimization is NP-hard.

Figures (13)

  • Figure 1: One training iteration of GPT-3 1.3B with 4 pipeline stages and 6 microbatches on NVIDIA A100 GPUs, drawn to scale. For example, F5 and B5 in the S2 row denote forward and backward for the fifth microbatch on Stage 2. The critical path is traced with a blue line. Colors show power consumption. Other models are visualized in Appendix \ref{['sec:appendix-intrinsic-bloat']}.
  • Figure 2: Among two data parallel pipelines, the first one becomes a straggler. The non-straggler pipeline causes extrinsic energy bloat by running as fast as possible (a), which can be reduced by precisely slowing it down (b).
  • Figure 3: Three cases showing where the straggler pipeline's iteration time $T'$ can be. $T_{\min}$ and $T^*$ are the shortest and longest iteration times on the time--energy frontier. The black dot is when all computations run at the maximum speed, which wastes energy. The green dot is the energy-optimal iteration time of the non-straggler pipeline. Solid orange dots make up the frontier, and the orange dotted line shows that iteration energy increases beyond $T^*$.
  • Figure 4: Perseus architecture and workflow.
  • Figure 5: Starting from the energy schedule that consumes the minimum energy, we iteratively reduce its iteration time to trace up and iteratively discover the tradeoff frontier.
  • ...and 8 more figures

Theorems & Definitions (3)

  • theorem 1
  • theorem 2
  • theorem 3