Context selectivity with dynamic availability enables lifelong continual learning

Martin Barry; Wulfram Gerstner; Guillaume Bellec

Context selectivity with dynamic availability enables lifelong continual learning

Martin Barry, Wulfram Gerstner, Guillaume Bellec

TL;DR

This work introduces GateON, a simple yet powerful meta-plasticity framework for lifelong continual learning that combines gated context selectivity with a dynamic availability mechanism to regulate plasticity across tasks. It provides both a normative parametric theory (p-GateON) and a bio-plausible neuro-centric instantiation (n-GateON), unifying context gating and task-specific consolidation without replay. Empirical results across MNIST variants, CIFAR-100, and NLP benchmarks (including BERT-based settings) show strong forward transfer and reduced forgetting, outperforming several replay-free baselines and remaining effective as task counts scale. The paper also offers experimental neuroscience predictions, arguing that neuronal availability signals and metaplasticity-like dynamics could underlie lifelong learning in the brain, with practical implications for designing robust CL systems in AI. Overall, GateON presents a parsimonious, testable mechanism that balances forgetting and consolidation, enabling transfer across modalities and suggesting concrete paths for neuroscience-informed CL research.

Abstract

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.

Context selectivity with dynamic availability enables lifelong continual learning

TL;DR

Abstract

Paper Structure (38 sections, 21 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 38 sections, 21 equations, 6 figures, 13 tables, 1 algorithm.

Results
A normative theory for continual learning
Principle 1: Gated context selectivity.
Extension of Principle 1 to unidentified context switches.
Principle 2: Gradient Obstruction - freezing and unfreezing of plasticity.
The parametric view.
Definition of the availability.
Algorithmic relevance estimation.
Bio-plausible implementation of GateON
The selectivity of cortical neurons to experimental context supports Principle 1.
A plausible neuro-centric implementation of Principle 2.
Further simplification of the relevance computation?
Framework for Simulation Results
Model comparison on established image CL problems
Ablation study with 100 MNIST tasks.
...and 23 more sections

Figures (6)

Figure 1: Neural selectivity and learning availability. Top Panel: Illustration of the GateON selectivity mechanism across two sequential tasks. Consider the scenario wherein task 1 classifies the animal on the left side of an image, while task 2 classifies entities on the right. In the neural network, the context layer silences the neurons in blue and leaves the red ones active. Connections illustrated as dashed lines represent those with diminished availability by the end of the task. Bottom Panel: A representative simulation of n-GateON spanning a sequence of three tasks using permuted MNIST. We display the mean neural activity of task-selective neurons (x, top), the gating state of task-selective neurons (g, middle), and availability (A) for 250 mini-batch presentations. The blue and yellow trajectories represent averages across those neurons with activity above some threshold during tasks 2 or 3.
Figure 2: Effect of bio-plausible simplification of the relevance measure in n-GateON. Top: The average availability $\langle A_i^l\rangle$ of neurons (evaluated at times $t_k$ for tasks $1\le k \le 10$) decreases over 10 tasks for MNIST CL problems Permuted (left), Rotated (middle) and Shuffled MNIST (right) using n-GateON with $\epsilon=0$. The blue curve shows results with the algorithmic relevance of Eq. (\ref{['eq:n-mu']}), the green one with the simplified bioplausible relevance of Eq. (\ref{['eq-mu8']}) and the orange one shows an intermediate bioplausible variant for comparison (see Methods \ref{['matmet:layer-wise-derivation']}). Middle and Bottom: With all three relevance definitions, n-GateON approaches a similar value of a task-locked accuracy (middle) and continual accuracy (bottom) above $97\%$ after 500 training steps per task.
Figure 3: The $\epsilon$ parameter affects network saturation. The parameter $\epsilon$ can be understood as the rate for unfreezing parameters. A: Immediate test accuracy (in percent) across tasks for Permuted on n-GateON (with task identity given, not inferred). The colors refer to different values of the parameter $\epsilon$. For $\epsilon = 0$ (blue), the test accuracy drops after about 10 - 20 tasks indicating that the network saturates and cannot learn new tasks. B: Simulation where the context index $k$ in inferred. We show the fraction of missed detections of task switches for n-GateON (in percent).
Figure 4: Emergence of shared network structure through transfer learning. This figure showcases the Pearson correlations of context weights across all layers of n-GateON$_{CNN}$ 0 after, trained on 100 tasks for Permuted (top) and Rotated (bottom). We use n-GateON$_{CNN}$ to show the effect on both fully connected and convolutional layers. As expected for Permuted after $15$ tasks we observe a positive correlation of the context weights in the output layers suggesting that it is then shared across tasks, in contrast, there is no structure in early convolutional layers due to the randomness of the permutation. For Rotated we re-ordered after training the tasks by increasing angles. The context weights are correlated for neighboring tasks to facilitate generalization. The correlation across neighboring tasks is most striking at intermediate layers.
Figure 5: Context Detection in MNIST. A1-3 We executed 10 continual learning (CL) tasks and tracked the number of tasks identified by GateON compared to the number of tasks presented. The results are plotted for two $\Theta$ values, where $\Theta$ modulates task detection stringency as explained in Algorithm \ref{['matmet:alg:euclid']}. The dashed black line represents perfect task estimation accuracy. Results are averaged over 10 trials, with standard deviation indicated by the shaded region. A1 and A3 exhibit parity between the number of tasks presented and detected for both $\Theta$ parameters. In contrast, A2 reveals that for Rotated, GateON underestimates the actual task count, with larger $\Theta$ values leading to fewer tasks detected. Specifically, tasks with closely spaced angles may be misconstrued as a single task, leading to context collapse. B To elucidate the reasons for the incomplete task detection in A3, we executed two consecutive Rotated tasks with an angle difference $\Delta \alpha \in [-25,25]$, and charted the switch-point detection probability for GateON (averaged over 10 trials for each $\Delta \alpha$). The results indicate a binary behavior: if $\Delta \alpha$ exceeds a certain threshold, tasks are distinctly identified with a probability of 1; otherwise, tasks are amalgamated with equal certainty. This suggests a 100 percent switch-point detection rate if the angle difference is sufficiently large; otherwise, GateON perceives the tasks as identical. To confirm that similar tasks can reactivate prior ones, we ran in C a sequence of three tasks such that angles 1 and 3 differ by $\Delta \alpha$ and angle 2 is always far from the other two to be detected as switch-point. C demonstrates that the angle pairs leading to context collapse in B will also trigger reactivation, while those that resulted in distinct tasks do not reactivate. In summary, our context detector appears proficient at identifying switch-points between significantly dissimilar tasks and can also reactivate or collapse tasks when they are sufficiently alike.
...and 1 more figures

Context selectivity with dynamic availability enables lifelong continual learning

TL;DR

Abstract

Context selectivity with dynamic availability enables lifelong continual learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)