GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

Ding-Yong Hong; Tzu-Hsien Tsai; Ning Wang; Pangfeng Liu; Jan-Jan Wu

GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

Ding-Yong Hong, Tzu-Hsien Tsai, Ning Wang, Pangfeng Liu, Jan-Jan Wu

TL;DR

Deep network training incurs substantial GPU memory pressure from intermediate activations. The authors cast checkpoint selection as a memory-minimization optimization, first with an $O(n^3)$ dynamic-programming solution and then a refined $O(n)$-time dynamic programming approach that mirrors PyTorch memory behavior. They demonstrate that aligning the memory model with the framework enables significantly tighter peak memory bounds and comparable training times, outperforming prior $O(\sqrt{n})$ and ACG-based methods in practice. The work enables training larger models within fixed GPU memory and provides a principled, efficient method for memory-aware training in modern DL platforms.

Abstract

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity O(n3) to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an O(n)-time algorithm for finding the optimal checkpoint subset.

GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

TL;DR

Abstract

GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (12)