Table of Contents
Fetching ...

Measuring the Energy Consumption and Efficiency of Deep Neural Networks: An Empirical Analysis and Design Recommendations

Charles Edison Tripp, Jordan Perr-Sauer, Jamil Gafur, Amabarish Nag, Avi Purkayastha, Sagi Zisman, Erik A. Bensen

TL;DR

This work introduces the BUTTER-E dataset, an augmentation to the BUTTER Empirical Deep Learning dataset, containing energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations, and proposes a straightforward and effective energy model that accounts for network size, computing, and memory hierarchy.

Abstract

Addressing the so-called ``Red-AI'' trend of rising energy consumption by large-scale neural networks, this study investigates the actual energy consumption, as measured by node-level watt-meters, of training various fully connected neural network architectures. We introduce the BUTTER-E dataset, an augmentation to the BUTTER Empirical Deep Learning dataset, containing energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations: 13 datasets, 20 sizes (number of trainable parameters), 8 network ``shapes'', and 14 depths on both CPU and GPU hardware collected using node-level watt-meters. This dataset reveals the complex relationship between dataset size, network structure, and energy use, and highlights the impact of cache effects. We propose a straightforward and effective energy model that accounts for network size, computing, and memory hierarchy. Our analysis also uncovers a surprising, hardware-mediated non-linear relationship between energy efficiency and network design, challenging the assumption that reducing the number of parameters or FLOPs is the best way to achieve greater energy efficiency. Highlighting the need for cache-considerate algorithm development, we suggest a combined approach to energy efficient network, algorithm, and hardware design. This work contributes to the fields of sustainable computing and Green AI, offering practical guidance for creating more energy-efficient neural networks and promoting sustainable AI.

Measuring the Energy Consumption and Efficiency of Deep Neural Networks: An Empirical Analysis and Design Recommendations

TL;DR

This work introduces the BUTTER-E dataset, an augmentation to the BUTTER Empirical Deep Learning dataset, containing energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations, and proposes a straightforward and effective energy model that accounts for network size, computing, and memory hierarchy.

Abstract

Addressing the so-called ``Red-AI'' trend of rising energy consumption by large-scale neural networks, this study investigates the actual energy consumption, as measured by node-level watt-meters, of training various fully connected neural network architectures. We introduce the BUTTER-E dataset, an augmentation to the BUTTER Empirical Deep Learning dataset, containing energy consumption and performance data from 63,527 individual experimental runs spanning 30,582 distinct configurations: 13 datasets, 20 sizes (number of trainable parameters), 8 network ``shapes'', and 14 depths on both CPU and GPU hardware collected using node-level watt-meters. This dataset reveals the complex relationship between dataset size, network structure, and energy use, and highlights the impact of cache effects. We propose a straightforward and effective energy model that accounts for network size, computing, and memory hierarchy. Our analysis also uncovers a surprising, hardware-mediated non-linear relationship between energy efficiency and network design, challenging the assumption that reducing the number of parameters or FLOPs is the best way to achieve greater energy efficiency. Highlighting the need for cache-considerate algorithm development, we suggest a combined approach to energy efficient network, algorithm, and hardware design. This work contributes to the fields of sustainable computing and Green AI, offering practical guidance for creating more energy-efficient neural networks and promoting sustainable AI.
Paper Structure (9 sections, 6 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 9 sections, 6 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) The energy consumption (left axis) and corresponding carbon emissions given the average energy generation mix in the United States (right axis) incurred by training published AI models has increased dramatically over the last two decades, the so-called "Red AI Era." The AI System Total Energy is an computed metric, which accounts for efficiency gains in hardware (shown in (b)) and data center PUE. Note the logarithmic scale on the vertical axes for both (a) and (b). To compile this data we used CPU DB danowitz2012cpu, Intel, AMD, and NVIDIA manufacturer websites, and compiled tables on Wikipedia to estimate J/FLOP using thermal design power (TDP) and published throughput (FLOP/s) or estimated it using instruction set, clock frequency, and core count. We incorporated improvements in PUE statista_2023 and used the United States national average CO$_2$ emissions per kWh in 2021 eGRID_data_explorer_2024. The number of parameters/FLOPs for each model are from epochMachineLearningData2023.
  • Figure 2: Histograms (on logarithmic axes) showing the quantity and location of data filtered out for this analysis. The filters reject 241 runs, which is approximately 0.6% of the total number of runs.
  • Figure 3: (a) Illustrates the linear relationship between training set size and energy consumed per-epoch. (b) Shows the marginal effect of number of trainable parameters (NTP) on energy per training datum per epoch. NTP has a positive nonlinear relationship to energy used. (c) Shows the same for the effect of FLOPs. FLOPs has a positive nonlinear relationship to energy used that is very similar to NTP's relationship to energy. (d) Shows the same for the effect of Depth. Depth and energy per datum are also is positively related; we observe that deeper networks use more energy per batch.
  • Figure 4: (a) GPU and (b) CPU energy consumption as a function of working set size for four key working sets. The vertical lines spanning each subplot indicate the size of each physical cache level in the hardware. Energy consumption increases appear to coincide with certain working sets spilling into higher memory levels. We aggregate the energy consumption by using a windowed median for each working set, as a function of effective working set size, using a window size of $\pm 50\%$. For the Forward Pass only, the grey filled area spans the interquartile range and alpha-blended points show individual measurements.
  • Figure 5: Error distribution of the ratio of model predicted to measured energy, $\hat{E_t} / E_t$. The estimates are unbiased regardless of the hardware type used, but the accuracy appears to be lower for experiments when trained on CPU. This makes some intuitive sense, since CPUs have more complicated cache hierarchies and instruction sets than GPUs.
  • ...and 2 more figures