Table of Contents
Fetching ...

LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning

Marco Paul E. Apolinario, Kaushik Roy

Abstract

On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but rely on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decomposition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250$\times$ while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it performs competitively with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.

LANCE: Low Rank Activation Compression for Efficient On-Device Continual Learning

Abstract

On-device learning is essential for personalization, privacy, and long-term adaptation in resource-constrained environments. Achieving this requires efficient learning, both fine-tuning existing models and continually acquiring new tasks without catastrophic forgetting. Yet both settings are constrained by high memory cost of storing activations during backpropagation. Existing activation compression methods reduce this cost but rely on repeated low-rank decompositions, introducing computational overhead. Also, such methods have not been explored for continual learning. We propose LANCE (Low-rank Activation Compression), a framework that performs one-shot higher-order Singular Value Decomposition (SVD) to obtain a reusable low-rank subspace for activation projection. This eliminates repeated decompositions, reducing both memory and computation. Moreover, fixed low-rank subspaces further enable on-device continual learning by allocating tasks to orthogonal subspaces without storing large task-specific matrices. Experiments show that LANCE reduces activation storage up to 250 while maintaining accuracy comparable to full backpropagation on CIFAR-10/100, Oxford-IIIT Pets, Flowers102, and CUB-200 datasets. On continual learning benchmarks (Split CIFAR-100, Split MiniImageNet, 5-Datasets), it performs competitively with orthogonal gradient projection methods at a fraction of the memory cost. These results position LANCE as a practical and scalable solution for efficient fine-tuning and continual learning on edge devices.

Paper Structure

This paper contains 41 sections, 6 theorems, 16 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

theorem 1

Under Assumptions assump:orth and assump:linear, for any layer $l$ with input projector $\mathsf{P}^{(l)}$, the LANCE weight gradient equals the right-Frobenius orthogonal projection of the full gradient: $\nabla_{{\bm{W}}}\mathcal{L}_{\mathrm{LANCE}} = \nabla_{{\bm{W}}}\mathcal{L}_{\mathrm{full}}\,

Figures (6)

  • Figure 1: Overview of LANCE for on-device training. (a) In full backpropagation (BP), the entire activation tensor ${\bm{\mathsfit{X}}}^{(l)}$ must be stored in memory for computing gradients, leading to a large memory footprint. (b) LANCE replaces ${\bm{\mathsfit{X}}}^{(l)}$ with a compressed core tensor ${\bm{\mathsfit{G}}}^{(l)}$, obtained via one-shot HOSVD using fixed low-rank matrices $\{{\bm{U}}_i^{(l)}\}_{i=1}^d$ computed once at the beginning of training. Only ${\bm{\mathsfit{G}}}^{(l)}$ is stored, while the factors are reused during the backward pass. (c) Pareto comparisons show that LANCE reduces memory (SRAM) usage by up to $\sim$250$\times$ and FLOPs by $\sim$1.5$\times$ relative to vanilla BP, while maintaining accuracy.
  • Figure 2: Gradient alignment between full BP and LANCE. We plot the angle between true gradients and LANCE-projected gradients across epochs for different fine-tuning tasks. LANCE consistently produces gradients within 70$^\circ$ of the true gradient, and angles stabilize as training progresses, indicating preserved descent directions.
  • Figure 3: Ablation studies of LANCE. (Left) Effect of the energy threshold $\varepsilon$ on CUB-200 using MCUNet and ResNet34. Accuracy improves steadily as $\varepsilon$ increases, but memory usage grows exponentially, illustrating the trade-off between accuracy and compression. (Right) Effect of the number of calibration batches $N$ on CIFAR-100. Accuracy remains stable even for very small $N$, while memory increases with larger calibration sets. These results show that stable subspaces can be obtained with as few as $N{=}2$ calibration batches, and practical choices of $\varepsilon$ (e.g., 0.7) provide a good balance between accuracy and memory.
  • Figure 4: Gradient fidelity across compression ratios. We vary the energy threshold $\varepsilon$ controlling the effective memory compression of LANCE and report the angle (mean $\pm$ std) between full-BP gradients and LANCE-projected gradients over the final 10 epochs (out of 50). As $\varepsilon$ decreases, compression becomes more aggressive and gradient alignment degrades accordingly. Nevertheless, even under memory reductions of up to two orders of magnitude, LANCE maintains gradient directions within $\sim 70^\circ$ of full BP, indicating preserved descent directions despite strong compression.
  • Figure 5: End-to-end fine-tuning latency on a Raspberry Pi 3B+ (MCUNet on CIFAR-10, batch size 128). Averaged over five trials, LANCE consistently achieves the lowest forward, backward, and total training times across all layer depths.
  • ...and 1 more figures

Theorems & Definitions (10)

  • theorem 1: Projected gradient & descent
  • theorem 2: Monotone decrease & projected stationarity
  • proposition 1: Stationarity gap vs. truncation
  • theorem 2: Projected gradient & descent
  • proof
  • theorem 2: Monotone decrease & projected stationarity
  • proof
  • proposition 1: Stationarity gap vs. truncation
  • proof
  • remark 1: Why the leakage bound is reasonable