Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning
Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
TL;DR
This work tackles catastrophic forgetting in continual learning for large pre-trained models by rethinking LoRA-based mixture-of-experts. It introduces MoRA, which decomposes each rank-r update into $r$ rank-one experts and uses self-activated sparse gating to select a small, input-dependent subset of ranks, guided by an activation budget $k$ and temperature $\\tau_{MoRA}$. Interpreting weight updates as linear associative memories with key–value slots, MoRA enables fine-grained reuse of shared knowledge while preventing interference and excessive resource use. Empirically, MoRA achieves strong performance on CLIP and LM continual learning benchmarks, with reduced forgetting and improved generalization, using far fewer active parameters than traditional MoE methods and approaching multi-task learning on long sequences.
Abstract
Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approaches with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-one components, each treated as an independent expert, enabling fine-grained rank-one expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-one expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning benchmarks using CLIP and language models, analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness in enhancing CL with PTMs, and improving generalization while mitigating forgetting.
