Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Yuchen Yang; Yingdong Shi; Cheems Wang; Xiantong Zhen; Yuxuan Shi; Jun Xu

Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Yuchen Yang, Yingdong Shi, Cheems Wang, Xiantong Zhen, Yuxuan Shi, Jun Xu

TL;DR

This work tackles the high activation-memory cost during fine-tuning of large pretrained models. It develops Approximate Backpropagation (Approx-BP) to safely decouple forward and backward passes, enabling memory-efficient derivatives for non-linear modules; it also introduces Memory-Sharing Backpropagation (MS-BP) to remove activation-memory redundancy via merged LayerNorm/RMSNorm with adjacent linear layers. The paper derives ReGELU2 and ReSiLU2 as memory-efficient backward derivatives and proposes MS-LN/MS-RMSNorm to halve activation memory in normalization blocks, all without extra computation. Empirically, the approach reduces peak memory by up to about $30\%$ across ViT, LLaMA, and RoBERTa fine-tuning with no loss in training throughput and with comparable accuracy. This has practical impact for PEFT and full fine-tuning on memory-constrained hardware, and may extend to pretraining and longer sequence lengths in large transformers.

Abstract

Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to $\sim$$30\%$ of the peak memory usage. Our code is released at https://github.com/yyyyychen/LowMemoryBP.

Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

TL;DR

Abstract

Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)