Table of Contents
Fetching ...

Is Parameter Collision Hindering Continual Learning in LLMs?

Shuo Yang, Kun-Peng Ning, Yu-Yang Liu, Jia-Yu Yao, Yong-Hong Tian, Yi-Bing Song, Li Yuan

TL;DR

This work identifies parameter collisions as a critical bottleneck in continual learning for LLMs, arguing that non-collision is a sufficient condition for orthogonality and more crucial than mere orthogonality. It introduces Non-collision Low-Rank Adaptation (N-LoRA), which applies $\ell_1$ sparsity to task-specific updates $\Delta W_i = A_i B_i$, freezes previous tasks, and merges updates back into the base model, resulting in highly sparse, non-colliding subspaces and reduced interference across tasks. Theoretical analysis shows non-collision implies orthogonality, while empirical results on Standard CL and Large Number of Tasks benchmarks demonstrate that N-LoRA outperforms O-LoRA and other baselines, achieves stronger orthogonality (OO) and lower collision rates, and generalizes better to unseen tasks (e.g., ~$49.54\%$ unseen-task accuracy, ~$+19.78\%$ over O-LoRA). The approach is plug-and-play with existing PEFT methods and scales to large models like LLama-7B, offering a practical, scalable improvement for continual learning in LLMs.

Abstract

Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various domains.In this paper, we reveal that building non-collision parameters is a more critical factor in addressing CL challenges. Our theoretical and experimental analyses demonstrate that non-collision parameters can provide better task orthogonality, which is a sufficient but unnecessary condition. Furthermore, knowledge from multiple domains will be preserved in non-collision parameter subspaces, making it more difficult to forget previously seen data. Leveraging this insight, we propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs. Experimental results on multiple CL benchmarks indicate that N-LoRA achieves superior performance (+2.9), higher task orthogonality (*4.1 times), and lower parameter collision (*58.1 times) than SOTA methods.

Is Parameter Collision Hindering Continual Learning in LLMs?

TL;DR

This work identifies parameter collisions as a critical bottleneck in continual learning for LLMs, arguing that non-collision is a sufficient condition for orthogonality and more crucial than mere orthogonality. It introduces Non-collision Low-Rank Adaptation (N-LoRA), which applies sparsity to task-specific updates , freezes previous tasks, and merges updates back into the base model, resulting in highly sparse, non-colliding subspaces and reduced interference across tasks. Theoretical analysis shows non-collision implies orthogonality, while empirical results on Standard CL and Large Number of Tasks benchmarks demonstrate that N-LoRA outperforms O-LoRA and other baselines, achieves stronger orthogonality (OO) and lower collision rates, and generalizes better to unseen tasks (e.g., ~ unseen-task accuracy, ~ over O-LoRA). The approach is plug-and-play with existing PEFT methods and scales to large models like LLama-7B, offering a practical, scalable improvement for continual learning in LLMs.

Abstract

Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various domains.In this paper, we reveal that building non-collision parameters is a more critical factor in addressing CL challenges. Our theoretical and experimental analyses demonstrate that non-collision parameters can provide better task orthogonality, which is a sufficient but unnecessary condition. Furthermore, knowledge from multiple domains will be preserved in non-collision parameter subspaces, making it more difficult to forget previously seen data. Leveraging this insight, we propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs. Experimental results on multiple CL benchmarks indicate that N-LoRA achieves superior performance (+2.9), higher task orthogonality (*4.1 times), and lower parameter collision (*58.1 times) than SOTA methods.

Paper Structure

This paper contains 28 sections, 3 theorems, 53 equations, 6 figures, 8 tables.

Key Result

Theorem 1

For two parameter matrices $\Delta W_1$ and $\Delta W_2$ of the same dimensions, non-collision is a sufficient but not necessary condition for orthogonality. Specifically,

Figures (6)

  • Figure 1: (a) Orthogonal but Parameter Collision: Tasks $\tau_1$, $\tau_2$, and $\tau_3$ are mutually orthogonal but interaction within each space, resulting in parameter collision. (b) Non-collision and orthogonal: Tasks $\tau_1$, $\tau_2$, and $\tau_3$ update only along distinct, non-conflicting subspaces, preserving prior task knowledge. (c) Performance Comparison: N-LoRA (red) and O-LoRA (blue) are compared across various metrics, with N-LoRA achieving lower collision rates, improved orthogonality, and superior average accuracy.
  • Figure 2: Parameter distribution of O-LoRA and N-LoRA. Green and blue represent parameters from Task 1 and Task 2, respectively; red indicates collisions. Lower orthogonality values signify better orthogonality. (a) O-LoRA parameter distribution: Despite the orthogonality constraint, significant parameter collisions occur, resulting in an accuracy of $\bm{48.5\%}$ and an orthogonality of $\bm{14.9}$. (b) N-LoRA parameter distribution: N-LoRA significantly reduces parameter collisions, achieving better performance with an accuracy of $\bm{58.2\%}$ and an orthogonality of $\bm{2.8}$.
  • Figure 3: Nuclear norms of O-LoRA and N-LoRA. The top plot (blue) shows the nuclear norms for O-LoRA, indicating the subspace dimensionality used by each layer. The bottom plot (red) shows the nuclear norms for N-LoRA. The dashed lines in both plots represent the average nuclear norm across all layers.
  • Figure 4: Performance comparison on unseen tasks on Average Accuracy metric. N-LoRA consistently outperforms O-LoRA across all task order settings. The pre-trained model performs poorly, with accuracy close to random on unseen tasks.
  • Figure 5: The relationship between sparsity, collision rate, and forgetting. The x-axis and y-axis represent the Generalized Sparsity and Average Collision Rate, respectively. Data points for N-LoRA (red) are clustered in the lower-left corner, while O-LoRA (blue) data points are concentrated in the upper-right corner. The dashed ellipses indicate the average forgetting rates for each method.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • proof
  • proof
  • ...and 3 more