Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

Juneyoung Park; Yuri Hong; Seongwan Kim; Jaeho Lee

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee

TL;DR

Memory-efficient Structured Backpropagation (MeSP) is proposed, which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

Abstract

On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

TL;DR

Abstract

can be recomputed during backward at minimal cost since rank

, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity

0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

Paper Structure (46 sections, 14 equations, 2 figures, 11 tables)

This paper contains 46 sections, 14 equations, 2 figures, 11 tables.

Introduction
Related Work
Memory-Efficient Training.
Parameter-Efficient Fine-Tuning.
On-Device Training.
Background
Low-Rank Adaptation (LoRA)
Zeroth-Order Optimization (MeZO)
Memory-Efficient Backpropagation (MeBP)
Method
Core Idea: Trading Computation for Memory
Explicit Gradient Computation for LoRA
Layer-by-Layer Processing
Forward Phase.
Backward Phase.
...and 31 more sections

Figures (2)

Figure 1: Visualization of LoRA fine-tuning with Memory-efficient Structured Backpropagation (MeSP) compared to Memory-efficient Backpropagation (MeBP). (A) In MeSP, the intermediate LoRA projection $h = xA$ is not cached in the forward pass and is recomputed in the backward pass, then discarded immediately after being used to compute the gradients of $A$ and $B$. This keeps only a few essential activations in memory at any time, substantially reducing peak memory usage. (B) In MeBP, the intermediate projection $h$ is kept as a forward activation for each LoRA module and later loaded again during backpropagation to compute the gradients. Because these $h$ tensors remain in memory, this leads to higher peak memory usage than MeSP.
Figure 2: Training loss on Qwen2.5-0.5B. MeBP and MeSP achieve identical convergence (loss $\sim$3.33). MeZO converges to $\sim$3.45 (3.6% higher).

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

TL;DR

Abstract

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)