Table of Contents
Fetching ...

Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

TL;DR

This work tackles catastrophic forgetting in continual learning for large pre-trained models by rethinking LoRA-based mixture-of-experts. It introduces MoRA, which decomposes each rank-r update into $r$ rank-one experts and uses self-activated sparse gating to select a small, input-dependent subset of ranks, guided by an activation budget $k$ and temperature $\\tau_{MoRA}$. Interpreting weight updates as linear associative memories with key–value slots, MoRA enables fine-grained reuse of shared knowledge while preventing interference and excessive resource use. Empirically, MoRA achieves strong performance on CLIP and LM continual learning benchmarks, with reduced forgetting and improved generalization, using far fewer active parameters than traditional MoE methods and approaching multi-task learning on long sequences.

Abstract

Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approaches with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-one components, each treated as an independent expert, enabling fine-grained rank-one expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-one expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning benchmarks using CLIP and language models, analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness in enhancing CL with PTMs, and improving generalization while mitigating forgetting.

Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning

TL;DR

This work tackles catastrophic forgetting in continual learning for large pre-trained models by rethinking LoRA-based mixture-of-experts. It introduces MoRA, which decomposes each rank-r update into rank-one experts and uses self-activated sparse gating to select a small, input-dependent subset of ranks, guided by an activation budget and temperature . Interpreting weight updates as linear associative memories with key–value slots, MoRA enables fine-grained reuse of shared knowledge while preventing interference and excessive resource use. Empirically, MoRA achieves strong performance on CLIP and LM continual learning benchmarks, with reduced forgetting and improved generalization, using far fewer active parameters than traditional MoE methods and approaching multi-task learning on long sequences.

Abstract

Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approaches with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-one components, each treated as an independent expert, enabling fine-grained rank-one expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-one expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning benchmarks using CLIP and language models, analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness in enhancing CL with PTMs, and improving generalization while mitigating forgetting.

Paper Structure

This paper contains 24 sections, 11 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Conceptional illustration of CL with (a) LoRA, (b) MoE-LoRA, and (c) MoRA (Ours).
  • Figure 2: Overview of MoRA. For each new task, we freeze the ranks learned on previous tasks and introduce $r$ new ranks of updates. Our sparse self-activated mixture‐of‐ranks framework jointly considers all old and new ranks, adaptively inferring a sparse mixture weight for each rank. Panels (a,c) illustrate MoRA conceptually and (b,d) detail its computation for tasks $t$ and $t+1$, respectively.
  • Figure 3: Visualization of MoRA rank activations during Task 1 and Task 2 training. Activations are extracted from the K projection in the attention module (layer 8) of the image encoder. Corresponding image patches are shown below each activation map, with regions relevant to each class marked by orange bounding boxes. Zoom in for details. More visualizations are in Figs. \ref{['fig:supp_rank_visual']} and \ref{['fig:supp_rank_visual2']} of the Appendix, demonstrating forgetting mitigation and knowledge reuse.
  • Figure 4: Routing/activation strategy
  • Figure 4: Extended view of Fig. \ref{['fig:main_rank_visual']} illustrating forgetting mitigation. Regions corresponding to object semantics are highlighted with orange bounding boxes. Zoom in for details.
  • ...and 4 more figures