PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

Romain Cosentino

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

Romain Cosentino

TL;DR

PLATE addresses catastrophic forgetting in data-free continual learning of pretrained foundation models by exploiting geometric redundancy to protect old-task behavior while concentrating plasticity on redundant channels. It parameterizes updates as $\Delta W = B A Q^\top$, with $B$ selecting redundant output channels and $Q$ spanning a weight-derived low-energy input subspace, both computed from frozen weights. The approach yields a tunable retention-plasticity trade-off through two knobs, $r$ (number of trainable output channels) and $\tau$ (input-energy threshold), and demonstrates competitive or superior retention compared with LoRA across language, vision, and synthetic benchmarks, including out-of-distribution LLM specialization. This data-free method enables scalable, geometry-aware continual adaptation of large models by offering explicit control over forgetting and task performance, with practical implications for foundation-model deployment where old-task data are unavailable.

Abstract

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

TL;DR

, with

selecting redundant output channels and

spanning a weight-derived low-energy input subspace, both computed from frozen weights. The approach yields a tunable retention-plasticity trade-off through two knobs,

(number of trainable output channels) and

(input-energy threshold), and demonstrates competitive or superior retention compared with LoRA across language, vision, and synthetic benchmarks, including out-of-distribution LLM specialization. This data-free method enables scalable, geometry-aware continual adaptation of large models by offering explicit control over forgetting and task performance, with practical implications for foundation-model deployment where old-task data are unavailable.

Abstract

, where

and

are computed once from pretrained weights and kept frozen, and only

is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

Paper Structure (36 sections, 6 theorems, 42 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 36 sections, 6 theorems, 42 equations, 10 figures, 2 tables, 2 algorithms.

Introduction
Data-Free Constraints for Continual Learning
Layerwise exact invariance orthogonality
Approximate orthogonality implies a forgetting floor
Data-free protected subspaces from neuron redundancy
Low-Curvature Update Families
A local quadratic view of worst-case forgetting
From curvature to functional drift
Designing low-drift update families without access to old-task data
PLATE: Plasticity-Tunable Efficient Adapters
Algorithm overview
Constructing the redundant-neuron selector $B$
Constructing the weight-derived input basis $Q$
Experiments
General protocol
...and 21 more sections

Key Result

Proposition 1

If $\Delta\theta$ satisfies eq:per-sample-orth, then $\mathcal{F}_0(\theta_0,\theta_0+\Delta\theta)=0$.

Figures (10)

Figure 1: Local-geometry view of forgetting on a continual learning $2$-dimensional binary classification problem: Blue points denote the old-task dataset $P_0$ and yellow points the new-task dataset $P_1$; decision boundaries are shown when trained on $P_0$ (blue curve) and after training on $P_1$ (yellow curve). The background heatmap visualizes how the training on $P_1$ change the model's local input-output linearization, $\Delta(x)\coloneqq\|J_x(\theta_1,x)-J_x(\theta_0,x)\|_F$. Retention is compromised when the heatmap turns yellow around the blue points (large drift on $\mathrm{supp}(P_0)$), while effective learning requires yellow regions around the yellow points (large drift concentrated near $\mathrm{supp}(P_1)$). This motivates our goal: parameter-efficient continual updates that localize drift away from the (often unavailable) old distribution while remaining expressive on the new task. (Left) Full fine-tuning induces large changes throughout, including on $\mathrm{supp}(P_0)$, and the old boundary drifts. (Middle) LoRA restricts the parameter update but still produces substantial change on $\mathrm{supp}(P_0)$. (Right) PLATE updates keep $\Delta(x)$ small on $\mathrm{supp}(P_0)$ while permitting large changes near $P_1$, concentrating plasticity where it is needed and preserving old behavior (see Figure \ref{['fig:plate-parameter-sweep']} for PLATE hyperparameters sweep).
Figure 2: Restricted-curvature forgetting: We train an MLP on MNIST digits 0-4 to obtain parameters $\theta_0$. For each method, we perturb the trained model by $\theta_0 + \rho v$ and measure the resulting forgetting $\mathcal{F}0(\theta_0, \theta_0 + \rho v) = L_0(\theta_0 + \rho v) - L_0(\theta_0)$ where $v$ is the unit vector in each subspace that maximizes $v^T H_0 v$, i.e., highest-curvature direction. PLATE exhibits the smallest slope, indicating substantially reduced restricted curvature and correspondingly smaller worst-case forgetting.
Figure 3: PLATE exhibits a controllable forgetting-plasticity spectrum via $(r,\tau)$: We sweep PLATE’s hyperparameters on a two-moons continual-learning toy: the number of adapted (redundant) output neurons $r$ (rows) and the input energy threshold $\tau$ (columns), where larger $\tau$ enforces a stricter input-side constraint (smaller $k$). Each panel overlays Dataset 1 (blue) and Dataset 2 (yellow), and visualizes how adaptation changes the model’s local input-output geometry using the Jacobian-drift heatmap $\Delta(x)=\|J_x(\theta_1,x)-J_x(\theta_0,x)\|_F$. We report forgetting on dataset 1 and learning accuracy on dataset 2. Increasing $r$ expands the plasticity budget and improves dataset 2 performance but can increase dataset 1 drift/forgetting, while increasing $\tau$ tends to concentrate updates onto more redundant degrees of freedom and reduces drift/forgetting. Overall, PLATE provides an explicit mechanism to target a desired point on the retention-adaptation trade-off.
Figure 4: Qwen2.5-7B on DeepSeek-R1 reasoning: (Left) Learning capabilities on maths/reasoning dataset. (Right) Forgetting on instruction following dataset. PLATE (green) matches LoRA (blue) on math/reasoning benchmarks while preserving instruction-following (IFEval), whereas LoRA exhibits substantial OOD forgetting relative to the base model.
Figure 5: OLMo-2-7B on Tulu-3: (Left) IFEval accuracy vs. percentage of trainable parameters. The red dashed line is the base model capabilities on IFEval. (Right) MATH forgetting (drop from base) versus trainable parameters. PLATE (green) improves IFEval roughly linearly with parameter budget while keeping forgetting almost flat, whereas LoRA (blue) quickly saturates on IFEval and accumulates much larger MATH forgetting.
...and 5 more figures

Theorems & Definitions (13)

Definition 1: Layerwise, per-sample orthogonality on $P_0$
Proposition 1: Per-layer orthogonality yields no forgetting (Proof in Appendix \ref{['proof:no-forgetting-exact']})
Theorem 1: Lower bound on worst-case forgetting under approximate orthogonality (Proof in Appendix \ref{['proof:lower-bound-orth']})
Proposition 2: Layerwise weights lie in a data-dependent prototype subspace garrod2024persistence (Proof in Appendix \ref{['proof:layerwise_prototype_span']})
Proposition 3: Upper bound via restricted curvature (Proof in Appendix \ref{['proof:generic-upper']})
Proposition 4: Restricted curvature is bounded by functional drift (Proof in Appendix \ref{['proof:gn-upper']})
Theorem 2: Worst-case forgetting is controlled by functional drift (Proof in Appendix \ref{['proof:drift-upper']})
proof
proof
proof
...and 3 more

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

TL;DR

Abstract

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (13)