Less Memory Means smaller GPUs: Backpropagation with Compressed Activations

Daniel Barley; Holger Fröning

Less Memory Means smaller GPUs: Backpropagation with Compressed Activations

Daniel Barley, Holger Fröning

TL;DR

This work considers compressing activation maps for the backward pass using pooling, which can reduce both the memory footprint and amount of data movement, and empirically shows convergence and study effects on feature detection at the example of the common vision architecture ResNet.

Abstract

The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators, such as GPUs or TPUs. Next to the vast number of floating point operations the memory footprint of DNNs is also exploding. In contrast, GPU architectures are notoriously short on memory. Even comparatively small architectures like some EfficientNet variants cannot be trained on a single consumer-grade GPU at reasonable mini-batch sizes. During training, intermediate input activations have to be stored until backpropagation for gradient calculation. These make up the vast majority of the memory footprint. In this work we therefore consider compressing activation maps for the backward pass using pooling, which can reduce both the memory footprint and amount of data movement. The forward computation remains uncompressed. We empirically show convergence and study effects on feature detection at the example of the common vision architecture ResNet. With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule, while maintaining prediction accuracy compared to an uncompressed baseline.

Less Memory Means smaller GPUs: Backpropagation with Compressed Activations

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 6 figures, 1 table)

This paper contains 12 sections, 1 equation, 6 figures, 1 table.

Introduction
Related Work
Model Characterization
Compressing Activation Maps
Experiments
Training dynamics
Layer sensitivity
Memory footprint reduction
Maintaining accuracy
Discussion and Future Work
Acknowledgments.
Disclosure of Interests.

Figures (6)

Figure 1: Memory footprint during training of common vision architectures split into model parameters and activations. The ratio is heavily skewed toward activations. The most balanced is vit_l_32 at 68.8% activations. On average, activations are responsible for 91.8% of the memory consumed. EfficientNet_[B1-B4] are the most extreme examples at 98.7% activations. The mini-batch size is 32 for all models. The red lines show the memory capacity of consumer-grade GPUs of the NVIDIA 4000 series for comparison. Note that the memory capacity is given in G, while our measurements are in Gi. The printed percentages represent the proportion of activations in regard to the peak memory usage.
Figure 2: Fine-grained tracing of allocations during model initialization and processing of a single mini batch of data for the ResNet152 architecture. The red line shows the step-wise allocation of forward activations and the reverse as the backward computation unfolds. After the backward computation is completed, a second peak can be observed during the optimization stage. We can simplify this representation by only measuring at points of interest: model/input (1./2.) initialization, forward peak (3.), after backward (4.), optimizer peak (5.) represented by the blue marks.
Figure 3: Example of activation map compression for a convolutional layer. During the forward computation the input activation $X$ is convolved with the weight tensor $W$ and a bias $b$ is added to produce output $Y$. Before saving the activation map for the backward pass, we compress it using pooling. Except for the original dimensions, this requires no additional encoding overhead. During backpropagation we inflate the compressed activation $Z$ back to the original dimensions and use it to obtain the weight gradient, as shown in red. The other gradients, shown in green, are not affected by this, as they do not depend on the activations. This ensures that the propagated activation gradient remains accurate throughout the network. Only the weight updates themselves are imprecise.
Figure 4: Overview of the ResNet architecture (c) and its residual building blocks. The downsample block (a) features a $(1 \times 1)$ convolution in the residual path, whereas the basic block (b) adds the input unaltered.
Figure 5: Training loss for ResNet18 trained on ImageNet using SGD with momentum and step learning rate scheduler. Compression seems to introduce an offset to the loss curve.
...and 1 more figures

Less Memory Means smaller GPUs: Backpropagation with Compressed Activations

TL;DR

Abstract

Less Memory Means smaller GPUs: Backpropagation with Compressed Activations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)