Table of Contents
Fetching ...

Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks

Nick Kocher, Christian Wassermann, Leona Hennig, Jonas Seng, Holger Hoos, Kristian Kersting, Marius Lindauer, Matthias Müller

TL;DR

The paper addresses the need for energy-aware NAS benchmarks that balance model accuracy and energy use. It proposes three design principles and evaluates EA-HAS-Bench using multiple power measurement tools, identifying SMI sampling issues and the importance of holistic cost reporting. The main findings show that NVML-based measurements and Code Carbon provide reliable energy estimates, while Code Carbon requires calibration to avoid memory overestimation; calibrations can significantly narrow error bounds. The work offers practical guidelines to improve reproducibility and trust in energy-aware NAS benchmarks and highlights directions for hardware-agnostic and transferable benchmarking.

Abstract

Neural Architecture Search (NAS) accelerates progress in deep learning through systematic refinement of model architectures. The downside is increasingly large energy consumption during the search process. Surrogate-based benchmarking mitigates the cost of full training by querying a pre-trained surrogate to obtain an estimate for the quality of the model. Specifically, energy-aware benchmarking aims to make it possible for NAS to favourably trade off model energy consumption against accuracy. Towards this end, we propose three design principles for such energy-aware benchmarks: (i) reliable power measurements, (ii) a wide range of GPU usage, and (iii) holistic cost reporting. We analyse EA-HAS-Bench based on these principles and find that the choice of GPU measurement API has a large impact on the quality of results. Using the Nvidia System Management Interface (SMI) on top of its underlying library influences the sampling rate during the initial data collection, returning faulty low-power estimations. This results in poor correlation with accurate measurements obtained from an external power meter. With this study, we bring to attention several key considerations when performing energy-aware surrogate-based benchmarking and derive first guidelines that can help design novel benchmarks. We show a narrow usage range of the four GPUs attached to our device, ranging from 146 W to 305 W in a single-GPU setting, and narrowing down even further when using all four GPUs. To improve holistic energy reporting, we propose calibration experiments over assumptions made in popular tools, such as Code Carbon, thus achieving reductions in the maximum inaccuracy from 10.3 % to 8.9 % without and to 6.6 % with prior estimation of the expected load on the device.

Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks

TL;DR

The paper addresses the need for energy-aware NAS benchmarks that balance model accuracy and energy use. It proposes three design principles and evaluates EA-HAS-Bench using multiple power measurement tools, identifying SMI sampling issues and the importance of holistic cost reporting. The main findings show that NVML-based measurements and Code Carbon provide reliable energy estimates, while Code Carbon requires calibration to avoid memory overestimation; calibrations can significantly narrow error bounds. The work offers practical guidelines to improve reproducibility and trust in energy-aware NAS benchmarks and highlights directions for hardware-agnostic and transferable benchmarking.

Abstract

Neural Architecture Search (NAS) accelerates progress in deep learning through systematic refinement of model architectures. The downside is increasingly large energy consumption during the search process. Surrogate-based benchmarking mitigates the cost of full training by querying a pre-trained surrogate to obtain an estimate for the quality of the model. Specifically, energy-aware benchmarking aims to make it possible for NAS to favourably trade off model energy consumption against accuracy. Towards this end, we propose three design principles for such energy-aware benchmarks: (i) reliable power measurements, (ii) a wide range of GPU usage, and (iii) holistic cost reporting. We analyse EA-HAS-Bench based on these principles and find that the choice of GPU measurement API has a large impact on the quality of results. Using the Nvidia System Management Interface (SMI) on top of its underlying library influences the sampling rate during the initial data collection, returning faulty low-power estimations. This results in poor correlation with accurate measurements obtained from an external power meter. With this study, we bring to attention several key considerations when performing energy-aware surrogate-based benchmarking and derive first guidelines that can help design novel benchmarks. We show a narrow usage range of the four GPUs attached to our device, ranging from 146 W to 305 W in a single-GPU setting, and narrowing down even further when using all four GPUs. To improve holistic energy reporting, we propose calibration experiments over assumptions made in popular tools, such as Code Carbon, thus achieving reductions in the maximum inaccuracy from 10.3 % to 8.9 % without and to 6.6 % with prior estimation of the expected load on the device.

Paper Structure

This paper contains 17 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Switching of low- and high-power states measured by SMI queries during training of a neural architecture on one GPU. The red line is the power measured by the external power meter. The blue line is the power measured by SMI and the green line is the difference between the two.
  • Figure 2: eCDF of the power meter measurements and the SMI sampled power estimates for an example model training on one GPU.
  • Figure 3: GPU power vs the ratio of taken samples and expected samples. We see that there are no high power states with a low sample ratio.
  • Figure 4: Top: eCDF of the aggregated single-GPU training for SMI and power meter measurements. Bottom: eCDF of the aggregated multi-GPU training for SMI and power meter measurements.
  • Figure 5: Top left: GPU power consumption vs GPU utilisation for single-GPU training. Bottom left: GPU power consumption vs GPU memory utilisation for single-GPU training. Top right: GPU power consumption vs GPU utilisation for multi-GPU training. Bottom right: GPU power consumption vs GPU memory utilisation for multi-GPU training. All data is aggregated across epochs. Brighter colours indicate higher standard deviation.
  • ...and 3 more figures