Table of Contents
Fetching ...

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, Jeff Dean

TL;DR

The energy use and carbon footprint of several recent large models are calculated-T5, Meena, GShard, Switch Transformer, and GPT-3-and earlier estimates for the neural architecture search that found Evolved Transformer are refined to avoid miscalculations.

Abstract

The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.

Carbon Emissions and Large Neural Network Training

TL;DR

The energy use and carbon footprint of several recent large models are calculated-T5, Meena, GShard, Switch Transformer, and GPT-3-and earlier estimates for the neural architecture search that found Evolved Transformer are refined to avoid miscalculations.

Abstract

The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.

Paper Structure

This paper contains 22 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Improvement in CO2e over Transformer (Big) on P100 GPUs in an average US datacenter versus Evolved Transformer (Medium) on TPU v2s in the Google lowa datacenter.
  • Figure 2: Total FLOPS versus number of parameters relative to Transformer (Big) in a log-log graph (Table 1). While all are not doing the same tasks, a reason T5 has relatively lower FLOPS relative to its number of parameters is that it trains until the accuracy is good enough instead of to the best possible accuracy. [Kap20] notes that some architectures have a much lower footprint than others at equivalent accuracy and suggests that significant power might be saved by revisiting accuracy requirements.
  • Figure 3: Accelerator years of computation, energy consumption, and$\mathrm{CO}_{2} \mathrm{e}$ for five large NLP DNNs.
  • Figure 4: Reproduction of Figure 4 from So et al. Dots on the blue line represent various sizes of plain Transformer NLP models, while dots on the red line represent various sizes of the open-sourced Evolved Transformer architecture that was discovered by the neural architecture search run in [So19]. Red arrows are at$\mathbf{1 3 1 M}$ and $\mathbf{2 1 0 M}$ parameters and show that an Evolved Transformer can achieve higher accuracy at less cost: it runs $1.3 \times$ faster and produces $1.3 \times$ less $\mathrm{CO}_{2} \mathrm{e}$.
  • Figure 5: Measured vs peak performance, measured system power vs peak chip power (TDP), and measured vs peak performance/Watt for V100 GPU and TPU v3 (see Table 4 and Appendix A).
  • ...and 3 more figures