Table of Contents
Fetching ...

Shared LoRA Subspaces for almost Strict Continual Learning

Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Rama Chellappa, Alan Yuille

TL;DR

Share tackles the problem of efficiently and continually adapting large pretrained models without data replay by learning a single, shared low-rank subspace. It initializes a foundational subspace from existing LoRA adapters via SVD and incrementally expands it with new information, training only lightweight coefficients and analytically updating the subspace to preserve past knowledge. Theoretical analysis provides incremental subspace error bounds, and empirical results show up to $100\times$ parameter reduction and $281\times$ memory savings while achieving performance near jointly trained baselines across vision, language, and multimodal tasks. This approach enables scalable, asynchronous continual learning and model serving by compressing hundreds of adapters into one reusable subspace, reducing resource use and broadening access to continual learning with large models.

Abstract

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.

Shared LoRA Subspaces for almost Strict Continual Learning

TL;DR

Share tackles the problem of efficiently and continually adapting large pretrained models without data replay by learning a single, shared low-rank subspace. It initializes a foundational subspace from existing LoRA adapters via SVD and incrementally expands it with new information, training only lightweight coefficients and analytically updating the subspace to preserve past knowledge. Theoretical analysis provides incremental subspace error bounds, and empirical results show up to parameter reduction and memory savings while achieving performance near jointly trained baselines across vision, language, and multimodal tasks. This approach enables scalable, asynchronous continual learning and model serving by compressing hundreds of adapters into one reusable subspace, reducing resource use and broadening access to continual learning with large models.

Abstract

Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.
Paper Structure (38 sections, 5 theorems, 32 equations, 10 figures, 8 tables)

This paper contains 38 sections, 5 theorems, 32 equations, 10 figures, 8 tables.

Key Result

Proposition 1

(Incremental Subspace Error Bound): Let $\mathcal{D}^{t} = [D_1, D_2, \ldots, D_t] \in \mathbb{R}^{N_t \times d}$ be cumulatively stacked weight matrix up to task $\tau_t$, where $N_t = \sum_{i=1}^t n_i$. Using the Share approximation at task $t$ with $k$ principal basis vectors we get $\hat{\mathca where $\sigma_i$'s are the singular values of $\mathcal{D}^{t}$ for the non-principal basis vectors

Figures (10)

  • Figure 1: Evidence of a Shared Foundational Subspace in Continual Learning. Linear CKA similarity analysis reveals a universal weight subspace (orange) emerging during sequential learning. Three independent trajectories (red, green, blue), starting from different GLUE task subsets, show monotonic convergence to this shared subspace, reaching near-perfect alignment ($>0.95$) by task $T=5$. Shaded regions show standard deviation across experiments. These results demonstrate: (1) the existence of a common foundational weight subspace that efficiently encodes cross-task knowledge, and (2) our method's ability to discover it through continual adaptation without catastrophic forgetting. This convergence reveals how low-rank adapters naturally bias models toward shared weight structures that generalize across diverse tasks.
  • Figure 2: Share. Our continual reparameterization where only principal coefficients $\epsilon^t$ are trained. a.Initialization We initialize the principal factors ($\alpha^{0}, \beta^{0}$) of our Share model using available LoRA hu2021lora adapters ($A, B$). b. Continual Adaptation Few top $\varphi \ll k$ factors, shown as $\alpha^{0\to 1}, \beta^{0 \to 1}$, and temporary coefficients $\epsilon^{0\to 1}$ are fine-tuned when new data is incrementally received. Merging & Fine-tuning The factors $\alpha^{0}, \beta^{0}$ and temporary factors $\alpha^{0\to 1}, \beta^{0\to 1}$ are merged using the initialization procedure, and $\alpha^1, \beta^1, \epsilon^i_{\alpha,\beta} \quad \forall i \in [0,1]$ are analytically recalculated. $\epsilon^{1}$ can then be further fine-tuned to boost performance.
  • Figure 3: Comparing continually finetuned Share results with individual LoRAs on different tasks for text-to-image generation
  • Figure 4: Low Rank Adapters share a foundational subspace. We evaluate the Share-full model's performance against reconstruction error after finetuning on the GLUE benchmark. Compared with non-continuously trained LoRA submatrices (A and B, shown in red and blue colors), results show that Share's foundational subspace efficiently approximates all LoRAs, suggesting a shared subspace. The radius of the circles represent scaled up standard deviation of the reconstruction error
  • Figure 5: Progression of Factor Subspace of Share. The figure shows CKA similarity kornblith2019similarityneuralnetworkrepresentations between Share's final and intermediate factors in the Continual GLUE experiments (\ref{['ssec:glue']}). Share effectively incorporates new factors from incoming data, shown by increased similarity over time (circle size represents variance), while preserving and converging towards optimal principal factors.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Theorem 2
  • Proposition 1
  • Definition 1: Task Similarity
  • Corollary 1: Bounded Error Growth
  • Theorem 1
  • proof