Table of Contents
Fetching ...

ssProp: Energy-Efficient Training for Convolutional Neural Networks with Scheduled Sparse Back Propagation

Lujia Zhong, Shuo Huang, Yonggang Shi

TL;DR

The paper tackles the high energy and computational costs of training CNNs and large AI models by introducing ssProp, a scheduled sparse backprop scheme that applies channel-wise gradient selection with drop schedulers during backprop. The method is designed as a drop-in CNN module compatible with PyTorch, enabling nearly 40% reduction in backward FLOPs while potentially improving generalization by mitigating overfitting. It demonstrates efficacy across diverse datasets and tasks, including ImageNet-1k classification and diffusion-based generation, without requiring hardware sparsification acceleration. This work offers a practical, scalable approach to reduce energy consumption in AI development while preserving or enhancing model performance.

Abstract

Recently, deep learning has made remarkable strides, especially with generative modeling, such as large language models and probabilistic diffusion models. However, training these models often involves significant computational resources, requiring billions of petaFLOPs. This high resource consumption results in substantial energy usage and a large carbon footprint, raising critical environmental concerns. Back-propagation (BP) is a major source of computational expense during training deep learning models. To advance research on energy-efficient training and allow for sparse learning on any machine and device, we propose a general, energy-efficient convolution module that can be seamlessly integrated into any deep learning architecture. Specifically, we introduce channel-wise sparsity with additional gradient selection schedulers during backward based on the assumption that BP is often dense and inefficient, which can lead to over-fitting and high computational consumption. Our experiments demonstrate that our approach reduces 40\% computations while potentially improving model performance, validated on image classification and generation tasks. This reduction can lead to significant energy savings and a lower carbon footprint during the research and development phases of large-scale AI systems. Additionally, our method mitigates over-fitting in a manner distinct from Dropout, allowing it to be combined with Dropout to further enhance model performance and reduce computational resource usage. Extensive experiments validate that our method generalizes to a variety of datasets and tasks and is compatible with a wide range of deep learning architectures and modules. Code is publicly available at https://github.com/lujiazho/ssProp.

ssProp: Energy-Efficient Training for Convolutional Neural Networks with Scheduled Sparse Back Propagation

TL;DR

The paper tackles the high energy and computational costs of training CNNs and large AI models by introducing ssProp, a scheduled sparse backprop scheme that applies channel-wise gradient selection with drop schedulers during backprop. The method is designed as a drop-in CNN module compatible with PyTorch, enabling nearly 40% reduction in backward FLOPs while potentially improving generalization by mitigating overfitting. It demonstrates efficacy across diverse datasets and tasks, including ImageNet-1k classification and diffusion-based generation, without requiring hardware sparsification acceleration. This work offers a practical, scalable approach to reduce energy consumption in AI development while preserving or enhancing model performance.

Abstract

Recently, deep learning has made remarkable strides, especially with generative modeling, such as large language models and probabilistic diffusion models. However, training these models often involves significant computational resources, requiring billions of petaFLOPs. This high resource consumption results in substantial energy usage and a large carbon footprint, raising critical environmental concerns. Back-propagation (BP) is a major source of computational expense during training deep learning models. To advance research on energy-efficient training and allow for sparse learning on any machine and device, we propose a general, energy-efficient convolution module that can be seamlessly integrated into any deep learning architecture. Specifically, we introduce channel-wise sparsity with additional gradient selection schedulers during backward based on the assumption that BP is often dense and inefficient, which can lead to over-fitting and high computational consumption. Our experiments demonstrate that our approach reduces 40\% computations while potentially improving model performance, validated on image classification and generation tasks. This reduction can lead to significant energy savings and a lower carbon footprint during the research and development phases of large-scale AI systems. Additionally, our method mitigates over-fitting in a manner distinct from Dropout, allowing it to be combined with Dropout to further enhance model performance and reduce computational resource usage. Extensive experiments validate that our method generalizes to a variety of datasets and tasks and is compatible with a wide range of deep learning architectures and modules. Code is publicly available at https://github.com/lujiazho/ssProp.
Paper Structure (19 sections, 7 equations, 4 figures, 7 tables)

This paper contains 19 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our method in CNNs. (a) Workflow in training one convolution layer. (b) Flowchart of the convolution using the img2col and col2img, demonstrated with input shape of (1, 2, 3, 3) and a kernel size of (2, 2, 2, 2).
  • Figure 2: Sensitivity analysis. (a) Sparisified dimensions. (b) Gradients selection. (c) & (d) Sparsification schedulers.
  • Figure 3: Sparsely trained DDPM-generated samples on MNIST, FashionMNIST, and CelebA.
  • Figure 4: Test accuracy patterns of sparsely and normally trained CNN models on CIFAR100.