Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

Brady Steele

Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

Brady Steele

TL;DR

It is shown that rank affects forgetting only when task subspaces are similar (low angle), while orthogonal methods like O-LoRA provide minimal benefit when natural orthogonality is already high, reconciles seemingly contradictory findings in the literature.

Abstract

Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for adapting large pre-trained models, yet its behavior under continual learning remains poorly understood. We present a geometric theory characterizing catastrophic forgetting in LoRA through the lens of gradient subspace interactions. Our central finding is that forgetting is governed by a simple geometric law: $\mathcal{F} = α(1 - \cos^2θ_{\min}) + β$, where $θ_{\min}$ is the minimum principal angle between task gradient subspaces. This formulation reveals an approximate rank-invariance property, at high subspace angles, forgetting becomes largely independent of the adapter rank (coefficient of variation $\approx 0.8\%$ in controlled synthetic settings; CV $\approx 10$-$19\%$ on real benchmarks, suggesting this is regime-dependent rather than absolute). We validate our theory on synthetic tasks ($r=0.994$ correlation), Split-CIFAR100 with ViT-LoRA, and sequential GLUE with RoBERTa-LoRA. Our analysis reconciles seemingly contradictory findings in the literature: we show that rank affects forgetting only when task subspaces are similar (low angle), while orthogonal methods like O-LoRA provide minimal benefit when natural orthogonality is already high. These insights provide principled guidance for continual learning with parameter-efficient fine-tuning.

Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

TL;DR

Abstract

, where

is the minimum principal angle between task gradient subspaces. This formulation reveals an approximate rank-invariance property, at high subspace angles, forgetting becomes largely independent of the adapter rank (coefficient of variation

in controlled synthetic settings; CV

on real benchmarks, suggesting this is regime-dependent rather than absolute). We validate our theory on synthetic tasks (

correlation), Split-CIFAR100 with ViT-LoRA, and sequential GLUE with RoBERTa-LoRA. Our analysis reconciles seemingly contradictory findings in the literature: we show that rank affects forgetting only when task subspaces are similar (low angle), while orthogonal methods like O-LoRA provide minimal benefit when natural orthogonality is already high. These insights provide principled guidance for continual learning with parameter-efficient fine-tuning.

Paper Structure (45 sections, 3 theorems, 16 equations, 5 figures, 6 tables)

This paper contains 45 sections, 3 theorems, 16 equations, 5 figures, 6 tables.

Introduction
Key Contributions.
Clarification of novelty.
Related Work
Parameter-Efficient Fine-Tuning.
Continual Learning and Forgetting.
Gradient Subspace Methods.
LoRA and Forgetting.
Theoretical Framework
Problem Setup
Geometric Forgetting Bound
Rank-Invariance Corollary
Rank-Angle Interaction
Experiments
Experimental Setup
...and 30 more sections

Key Result

Theorem 1

Under the following assumptions: the forgetting on task $i$ after training on task $t > i$ satisfies: where $\theta_{\min}(i,t)$ is the minimum principal angle between $\mathcal{G}_i$ and $\mathcal{G}_t$, $\alpha = \eta L \|\Delta_t\|^2 / \mu$ is a scaling factor depending on learning rate $\eta$, smoothness $L$, update norm $\|\Delta_t\|$, and curvature $\mu$, and $\beta \geq 0$ is baseline for

Figures (5)

Figure 1: Conceptual illustration of the geometric forgetting theory. (a) Gradient subspaces for two sequential tasks with principal angle $\theta_{\min}$ between them. (b) The geometric forgetting law: the separation term $(1-\cos^2\theta_{\min})$ increases with principal angle, and empirically correlates with observed forgetting.
Figure 2: Validation of the geometric forgetting law. The interference term $(1-\cos^2\theta_{\min})$ strongly predicts forgetting on both synthetic tasks (circles, $r=0.994$) and CIFAR-100 (squares). The fitted line $\mathcal{F} = 1.93(1-\cos^2\theta) - 0.07$ achieves $R^2 = 0.987$.
Figure 3: Approximate rank-invariance validation. (a) Synthetic experiments show nearly identical forgetting across ranks 1--32 (CV = 0.84%). (b) Real benchmarks (CIFAR-100 and GLUE) show approximate rank-invariance with CV $<20\%$, consistent with our theoretical prediction in high-angle regimes.
Figure 4: Layer-wise analysis of interference-forgetting correlation. Six out of seven LoRA layers (blue) show positive correlation between $(1-\cos^2\theta)$ and forgetting, supporting local validity of the geometric theory. The aggregate correlation (dashed line) is $r=0.525$.
Figure 5: Comparison of orthogonal methods. Vanilla LoRA and O-LoRA achieve nearly identical forgetting ($p=0.73$, not significant) when natural task orthogonality is already high. Error bars show standard deviation across seeds.

Theorems & Definitions (7)

Definition 1: Gradient Subspace
Definition 2: Principal Angles
Theorem 1: Geometric Forgetting Bound (Empirically Parameterized)
proof : Proof Sketch
Corollary 2: Rank Invariance at High Angles
Proposition 3: Extended Rank-Angle Theory
proof

Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

TL;DR

Abstract

Subspace Geometry Governs Catastrophic Forgetting in Low-Rank Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)