Table of Contents
Fetching ...

Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

Yuchen Yang, Yingdong Shi, Cheems Wang, Xiantong Zhen, Yuxuan Shi, Jun Xu

TL;DR

This work tackles the high activation-memory cost during fine-tuning of large pretrained models. It develops Approximate Backpropagation (Approx-BP) to safely decouple forward and backward passes, enabling memory-efficient derivatives for non-linear modules; it also introduces Memory-Sharing Backpropagation (MS-BP) to remove activation-memory redundancy via merged LayerNorm/RMSNorm with adjacent linear layers. The paper derives ReGELU2 and ReSiLU2 as memory-efficient backward derivatives and proposes MS-LN/MS-RMSNorm to halve activation memory in normalization blocks, all without extra computation. Empirically, the approach reduces peak memory by up to about $30\%$ across ViT, LLaMA, and RoBERTa fine-tuning with no loss in training throughput and with comparable accuracy. This has practical impact for PEFT and full fine-tuning on memory-constrained hardware, and may extend to pretraining and longer sequence lengths in large transformers.

Abstract

Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to $\sim$$30\%$ of the peak memory usage. Our code is released at https://github.com/yyyyychen/LowMemoryBP.

Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation

TL;DR

This work tackles the high activation-memory cost during fine-tuning of large pretrained models. It develops Approximate Backpropagation (Approx-BP) to safely decouple forward and backward passes, enabling memory-efficient derivatives for non-linear modules; it also introduces Memory-Sharing Backpropagation (MS-BP) to remove activation-memory redundancy via merged LayerNorm/RMSNorm with adjacent linear layers. The paper derives ReGELU2 and ReSiLU2 as memory-efficient backward derivatives and proposes MS-LN/MS-RMSNorm to halve activation memory in normalization blocks, all without extra computation. Empirically, the approach reduces peak memory by up to about across ViT, LLaMA, and RoBERTa fine-tuning with no loss in training throughput and with comparable accuracy. This has practical impact for PEFT and full fine-tuning on memory-constrained hardware, and may extend to pretraining and longer sequence lengths in large transformers.

Abstract

Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to of the peak memory usage. Our code is released at https://github.com/yyyyychen/LowMemoryBP.

Paper Structure

This paper contains 38 sections, 4 theorems, 69 equations, 8 figures, 12 tables, 3 algorithms.

Key Result

Theorem 4.1

Under the definitions in subsec:approx-bp_theory, assume that: A1.$\widetilde{\bm{g}}(\ell(\bm{z}^L), \bm{z}, \bm{\theta})$ is uniformly Lipschitz continuous w.r.t. $\ell(\bm{z}^L)$ and $\bm{z}$. A2.$\ell(\bm{z}^L)$ is Lipschitz continuous. $\bm{h}^i(\bm{z}^{i-1},\bm{\theta}^i)$ is uniformly Lipschi

Figures (8)

  • Figure 1: Throughput (images/s) and memory usage (MiB) with LoRA hu2022lora (rank $=4$, batch size $=64$) on fine-tuning pretrained ViT-B dosovitskiy2020vit with CIFAR10/100 krizhevsky2009learning and FGVC jia2022vpt. "LoRA + CKPT": LoRA with gradient-checkpointing chen2016training on every block. "LoRA + Mesa": LoRA with 8-bit activation quantization on GELU and LayerNorm pan2021mesa. "LoRA + Ours": LoRA with our ReGELU2 and MS-LN. More details are provided in \ref{['sec:experiments']}.
  • Figure 2: Composition of activation memory usage in ViT and LLaMA. For LLaMA, we use LLaMA-13B as an example. Our method is feasible to reduce the activation memory usage of GELU/SiLU and LayerNorm/RMSNorm (the split parts).
  • Figure 3: Plot of our ReGELU2. The primitive function is still GELU, while the derivative function is a 4-segment step function that need 2 bits of activation memory for derivative calculation.
  • Figure 4: Convergence of ReGELU2 and MS-LN when using LoRA (rank $=4$) on ViT-base dosovitskiy2020vit. The training loss is the average over the training loss on CIFAR10/100 krizhevsky2009learning and FGVC jia2022vpt.
  • Figure 5: Composition of the activation memory in each block of ViT dosovitskiy2020vit. We assume Layer Normalization uses fp32, other operators use fp16 data type and each operator in the table is implemented as a single CUDA kernel. The unit of memory is the memory size of a tensor (16 bits type) with the shape $[b,n,c]$.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 4.1
  • Theorem 4.2
  • Proposition 4.3
  • Proposition 5.1
  • proof : Proof of \ref{['thm:abp']}
  • proof : Proof of \ref{['thm:abpt']}