Table of Contents
Fetching ...

Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning

Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, Joelle Pineau

TL;DR

Addressing the climate impact of ML compute, the paper introduces the experiment-impact-tracker to standardize real-time energy and carbon reporting and demonstrates its use with an RL energy leaderboard. It shows that current reporting is sparse and that FLOPs are a poor global proxy for energy, motivating per-experiment accounting and region-aware carbon intensities. The framework supports automated logging, online appendices, and mitigation strategies such as green defaults, energy-focused leaderboards, and region-based compute placement. The work provides a practical path toward transparent energy and carbon reporting for ML, with concrete tools and community-facing recommendations to promote sustainable research practices.

Abstract

Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.

Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning

TL;DR

Addressing the climate impact of ML compute, the paper introduces the experiment-impact-tracker to standardize real-time energy and carbon reporting and demonstrates its use with an RL energy leaderboard. It shows that current reporting is sparse and that FLOPs are a poor global proxy for energy, motivating per-experiment accounting and region-aware carbon intensities. The framework supports automated logging, online appendices, and mitigation strategies such as green defaults, energy-focused leaderboards, and region-based compute placement. The work provides a practical path toward transparent energy and carbon reporting for ML, with concrete tools and community-facing recommendations to promote sustainable research practices.

Abstract

Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.

Paper Structure

This paper contains 36 sections, 1 equation, 12 figures, 1 table.

Figures (12)

  • Figure 1: A diagram demonstrating how the released version of the tool works. The main process launches a monitoring thread which iterates over a list of metrics associated with function calls to other tools. For example, if available, we call Intel RAPL to collect CPU power draw or query caiso.org to get realtime carbon intensity data for California. Once all the data that is compatible with the current system is gathered, it is logged to a standardized log file and the process repeats. The main thread may check in on this thread for exceptions, but the thread will not interrupt the main process. Once the main thread exits, an atexit hook (which is called whenever the main process exits, either successfully or through an exception) gathers the final information (such as the time the experiment ended), logs it, and then ends both the monitor and main process.
  • Figure 2: Realtime carbon intensity (g$\text{CO}_{2eq}$/kWh) collected during one experiment using our framework. As the experiment continued, the sun rose in California, and with it the carbon intensity decreased.
  • Figure 3: We run 50,000 rounds of inference on a single sampled image through pre-trained image classification models and record kWh, experiment time, FPOs, and number of parameters (repeating 4 times on different random seeds). References for models, code, and expanded experiment details can be found in Appendix \ref{['app:imagenet']}. We run a similar analysis to canziani2016analysis and find (left) that FPOs are not strongly correlated with energy consumption ($R^2=0.083$, Pearson $0.289$) nor with time ($R^2 = 0.005$, Pearson $-0.074$) when measured across different architectures. However, within an architecture (right) correlations are much stronger. Only considering different versions of VGG, FPOs are strongly correlated with energy ($R^2=.999$, Pearson $1.0$) and time ($R^2=.998$, Pearson $.999$). Comparing parameters against energy yields similar results (see Appendix \ref{['app:imagenet']} for these results and plots against experiment runtime).
  • Figure 4: We compare carbon emissions (left) and kWh (right) of our Pong PPO experiment (see Appendix \ref{['app:fig2']} for more details) by using different estimation methods. By only using country wide or even regional average estimates, carbon emissions may be over or under-estimated (respectively). Similarly, by using partial information to estimate energy usage (right, for more information about the estimation methods see Appendix \ref{['app:fig2']}), estimates significantly differ from when collecting all data in real time (as in our method). Clearly, without detailed accounting, it is easy to over- or under-estimate carbon or energy emissions in a number of situations. Stars indicate level of significance: * p < .05, ** p < .01, *** p < .001, **** p < .0001. Annotation provided via: https://github.com/webermarcolivier/statannot.
  • Figure 5: We evaluate A2C, PPO, DQN, and A2C+VTraces on PongNoFrameskip-v4 (left) and BreakoutNoFrameskip-v4 (right), two common evaluation environments included in OpenAI Gym. We train for only 5M timesteps, less than prior work, to encourage energy efficiency and evaluate for 25 episodes every 250k timesteps. We show the Average Return across all evaluations throughout training (giving some measure of both ability and speed of convergence of an algorithm) as compared to the total energy in kWh. Weighted rankings of Average Return per kWh place A2C+Vtrace first on Pong and PPO first on Breakout. Using PPO versus DQN can yield significant energy savings, while retaining performance on both environments (in the 5M samples regime). See Appendix \ref{['app:rl']} for more details and results in terms of asymptotic performance.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Example 1
  • Example 2
  • Example 3
  • Example 4
  • Example 5
  • Example 6
  • Example 7