Table of Contents
Fetching ...

Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach

Challapalli Phanindra Revanth, Sumohana S. Channappayya, C Krishna Mohan

TL;DR

GradSamp addresses the energy cost of training deep learning models by sampling gradient updates from layer-wise Gaussian distributions, reducing backprop operations and enabling epoch skipping. The authors hypothesize that per-layer update errors follow Gaussian distributions due to CLT and the smooth, over-parameterized loss landscape, validating this through experiments across CNNs and transformers. They extend the approach to stochastic Federated Learning and demonstrate energy savings of up to 20–50% with negligible performance loss across image classification, object detection, segmentation, DA, and DG tasks. The work offers a practical path toward green DL in both centralized and decentralized settings, with broad applicability and robustness across architectures and tasks.

Abstract

Computing the loss gradient via backpropagation consumes considerable energy during deep learning (DL) model training. In this paper, we propose a novel approach to efficiently compute DL models' gradients to mitigate the substantial energy overhead associated with backpropagation. Exploiting the over-parameterized nature of DL models and the smoothness of their loss landscapes, we propose a method called {\em GradSamp} for sampling gradient updates from a Gaussian distribution. Specifically, we update model parameters at a given epoch (chosen periodically or randomly) by perturbing the parameters (element-wise) from the previous epoch with Gaussian ``noise''. The parameters of the Gaussian distribution are estimated using the error between the model parameter values from the two previous epochs. {\em GradSamp} not only streamlines gradient computation but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models, spanning various computer vision tasks such as image classification, object detection, and image segmentation. Additionally, we explore its efficacy in out-of-distribution scenarios such as Domain Adaptation (DA), Domain Generalization (DG), and decentralized settings like Federated Learning (FL). Our experimental results affirm the effectiveness of {\em GradSamp} in achieving notable energy savings without compromising performance, underscoring its versatility and potential impact in practical DL applications.

Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach

TL;DR

GradSamp addresses the energy cost of training deep learning models by sampling gradient updates from layer-wise Gaussian distributions, reducing backprop operations and enabling epoch skipping. The authors hypothesize that per-layer update errors follow Gaussian distributions due to CLT and the smooth, over-parameterized loss landscape, validating this through experiments across CNNs and transformers. They extend the approach to stochastic Federated Learning and demonstrate energy savings of up to 20–50% with negligible performance loss across image classification, object detection, segmentation, DA, and DG tasks. The work offers a practical path toward green DL in both centralized and decentralized settings, with broad applicability and robustness across architectures and tasks.

Abstract

Computing the loss gradient via backpropagation consumes considerable energy during deep learning (DL) model training. In this paper, we propose a novel approach to efficiently compute DL models' gradients to mitigate the substantial energy overhead associated with backpropagation. Exploiting the over-parameterized nature of DL models and the smoothness of their loss landscapes, we propose a method called {\em GradSamp} for sampling gradient updates from a Gaussian distribution. Specifically, we update model parameters at a given epoch (chosen periodically or randomly) by perturbing the parameters (element-wise) from the previous epoch with Gaussian ``noise''. The parameters of the Gaussian distribution are estimated using the error between the model parameter values from the two previous epochs. {\em GradSamp} not only streamlines gradient computation but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models, spanning various computer vision tasks such as image classification, object detection, and image segmentation. Additionally, we explore its efficacy in out-of-distribution scenarios such as Domain Adaptation (DA), Domain Generalization (DG), and decentralized settings like Federated Learning (FL). Our experimental results affirm the effectiveness of {\em GradSamp} in achieving notable energy savings without compromising performance, underscoring its versatility and potential impact in practical DL applications.
Paper Structure (23 sections, 8 equations, 4 figures, 13 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 4 figures, 13 tables, 1 algorithm.

Figures (4)

  • Figure 1: ResNet-50 error histograms plotted at different epochs, with gradients sampled every $10$ epochs on the CIFAR-10 dataset.
  • Figure 2: Swin Transformer error histograms plotted at different epochs, with gradients sampled every $10$ epochs on the CIFAR-10 dataset.
  • Figure 3: MLP-Mixer error histograms plotted at different epochs, with gradients sampled every $10$ epochs on the CIFAR-10 dataset.
  • Figure 4: Examples, where our hypothesis fails, include cases where the normality test did not hold. Despite the failure of the normality test, it's noteworthy that the histograms are mostly unimodal and could be modeled using a skewed Gaussian function.