Table of Contents
Fetching ...

Keep Moving: identifying task-relevant subspaces to maximise plasticity for newly learned tasks

Daniel Anthes, Sushrut Thorat, Peter König, Tim C. Kietzmann

TL;DR

This work addresses the stability-plasticity dilemma in continual learning by decomposing activation changes into two orthogonal subspaces: the readout range that can affect past task performance, and the nullspace that is invisible to past readouts. The authors develop a readout-based decomposition and a gradient-based functional-subspace estimation to diagnose and manipulate learning in linear and nonlinear networks. They show that regularisation methods over-constrain both subspaces, reducing plasticity, while gradient-projection approaches and replay-based methods can maintain high stability with greater learning flexibility. In nonlinear networks, they introduce a per-layer approximation using old-task gradients to estimate the functional range and nullspace, demonstrating that restricting learning to the functional nullspace yields strong stability with some plasticity trade-offs but can outperform traditional regularisers like EWC. Overall, the work provides a practical diagnostic framework and guiding principles for designing continual-learning algorithms that preserve prior knowledge while maximizing learning capacity for new tasks.

Abstract

Continual learning algorithms strive to acquire new knowledge while preserving prior information. Often, these algorithms emphasise stability and restrict network updates upon learning new tasks. In many cases, such restrictions come at a cost to the model's plasticity, i.e. the model's ability to adapt to the requirements of a new task. But is all change detrimental? Here, we approach this question by proposing that activation spaces in neural networks can be decomposed into two subspaces: a readout range in which change affects prior tasks and a null space in which change does not alter prior performance. Based on experiments with this novel technique, we show that, indeed, not all activation change is associated with forgetting. Instead, only change in the subspace visible to the readout of a task can lead to decreased stability, while restricting change outside of this subspace is associated only with a loss of plasticity. Analysing various commonly used algorithms, we show that regularisation-based techniques do not fully disentangle the two spaces and, as a result, restrict plasticity more than need be. We expand our results by investigating a linear model in which we can manipulate learning in the two subspaces directly and thus causally link activation changes to stability and plasticity. For hierarchical, nonlinear cases, we present an approximation that enables us to estimate functionally relevant subspaces at every layer of a deep nonlinear network, corroborating our previous insights. Together, this work provides novel means to derive insights into the mechanisms behind stability and plasticity in continual learning and may serve as a diagnostic tool to guide developments of future continual learning algorithms that stabilise inference while allowing maximal space for learning.

Keep Moving: identifying task-relevant subspaces to maximise plasticity for newly learned tasks

TL;DR

This work addresses the stability-plasticity dilemma in continual learning by decomposing activation changes into two orthogonal subspaces: the readout range that can affect past task performance, and the nullspace that is invisible to past readouts. The authors develop a readout-based decomposition and a gradient-based functional-subspace estimation to diagnose and manipulate learning in linear and nonlinear networks. They show that regularisation methods over-constrain both subspaces, reducing plasticity, while gradient-projection approaches and replay-based methods can maintain high stability with greater learning flexibility. In nonlinear networks, they introduce a per-layer approximation using old-task gradients to estimate the functional range and nullspace, demonstrating that restricting learning to the functional nullspace yields strong stability with some plasticity trade-offs but can outperform traditional regularisers like EWC. Overall, the work provides a practical diagnostic framework and guiding principles for designing continual-learning algorithms that preserve prior knowledge while maximizing learning capacity for new tasks.

Abstract

Continual learning algorithms strive to acquire new knowledge while preserving prior information. Often, these algorithms emphasise stability and restrict network updates upon learning new tasks. In many cases, such restrictions come at a cost to the model's plasticity, i.e. the model's ability to adapt to the requirements of a new task. But is all change detrimental? Here, we approach this question by proposing that activation spaces in neural networks can be decomposed into two subspaces: a readout range in which change affects prior tasks and a null space in which change does not alter prior performance. Based on experiments with this novel technique, we show that, indeed, not all activation change is associated with forgetting. Instead, only change in the subspace visible to the readout of a task can lead to decreased stability, while restricting change outside of this subspace is associated only with a loss of plasticity. Analysing various commonly used algorithms, we show that regularisation-based techniques do not fully disentangle the two spaces and, as a result, restrict plasticity more than need be. We expand our results by investigating a linear model in which we can manipulate learning in the two subspaces directly and thus causally link activation changes to stability and plasticity. For hierarchical, nonlinear cases, we present an approximation that enables us to estimate functionally relevant subspaces at every layer of a deep nonlinear network, corroborating our previous insights. Together, this work provides novel means to derive insights into the mechanisms behind stability and plasticity in continual learning and may serve as a diagnostic tool to guide developments of future continual learning algorithms that stabilise inference while allowing maximal space for learning.
Paper Structure (24 sections, 4 equations, 7 figures)

This paper contains 24 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Linking stability and plasticity to changes in activations seen by prior readouts.(A) Learning for a new task can cause two kinds of activation change from the perspective of the old task's readout. Changes perpendicular to the decision boundary for the old task can affect stability (left). Changes parallel to the decision boundary are invisible to the readout and cannot cause forgetting (right). The conceptual plots show the change in hypothetical activation patterns. (B) The readout range and nullspace define a new basis in which activation change can be meaningfully linked to stability and plasticity, respectively. (C) Activation change in the range of the old readout $\left(\mathbf{CC^{\top}}\right)$ affects stability while restricting activation space in any subspace may be detrimental to plasticity. Taking into account both constraints, a successful continual learner is expected to restrict only learning in the range of old tasks.
  • Figure 2: Stability and plasticity trade-off curves comparing selected continual learning algorithms(A) Stability and plasticity of selected regularisation and replay methods for continual learning. Hue indicates the algorithm used and shading indicates the strength of regularisation used (For a full list of parameters for each algorithm see \ref{['sec:appendix_nonlinear']}). Dotted lines indicate three levels of stability for which we compare activation change in panel B. With increasing regularisation strength, the replay-based algorithms - data replay and GEM - maintain higher plasticity while maintaining high stability, as compared to the regularisation algorithms - EWC, SI, and LwF. (B) Activation change at the pre-readout layer for data from the first task as a result of learning $10$ additional tasks are shown. Activation change is decomposed into the range and null space of the readout for task 1. The three panels show the activation change for the tested algorithms, approximately matched for stability (at the stability levels indicated in panel A with dotted lines). At a given stability level, a higher degree of activation change in null space corresponds to more plasticity. Analyses of the stability-plasticity trade-offs of these algorithms and the corresponding displacements in range and null space of task 1 readout are shown in Appendix Figure \ref{['fig:comparison_over_phases']}.
  • Figure 3: Plasticity and stability achieved by a one hidden-layer linear neural network trained on the Split MNIST task with gradient decomposition in the range and nullspace of the old task readout.(A) Plasticity and stability of the network trained with different configurations for $\alpha$ and $\beta$. Stability and plasticity of networks trained with gradient-based subspace decomposition and EWC. Each data point shows the performance of a network on the first task (stability) and the second task (plasticity), after training on both tasks. Data points are coloured according to the algorithm and parameters used. Green hues indicate networks trained with EWC with darker shades indicating stronger regularisation. Red and blue hues indicate networks trained with readout weight-based activation decomposition into the old readout's range and nullspace. Red hues show networks where only learning in the range is restricted ($\alpha$ is varied, $\beta=1$). Blue hues indicate networks where learning in the range is restricted completely ($\alpha=0$) and restrictions on the functional nullspace are varied ($\beta$). (B) Activation change for data of the first task as a result of learning the second task. Data points show movement corresponding to stability and plasticity results in (A). The color of the overlaid contour indicates Stability + Plasticity (as in Fig.\ref{['fig:conceptual']}C) A darker colour indicates high stability and plasticity. For extended results see Fig. \ref{['fig:extended_contour']}
  • Figure 4: Stability and plasticity in a nonlinear network trained on Split CIFAR-10 using gradient-based activation decomposition and EWC.(A) Stability and Plasticity of networks trained with gradient-based subspace decomposition and with EWC. Each data point shows the performance of a network on the first task (stability) and the second task (plasticity), after training on both tasks. Data points are coloured according to the algorithm and parameters used. Green hues indicate networks trained with EWC with darker shades indicating stronger regularisation. Red and blue hues indicate networks trained with gradient-based activation decomposition. Red hues show networks where only the functional range is restricted ($\alpha$ is varied, $\beta=1$ ). Blue hues indicate networks where learning in the range is restricted completely ($\alpha=0$) and restrictions on the functional nullspace are varied ($\beta$). (B) Activation change at the pre-readout layer for data from the first task as a result of learning the second task. Activation change is decomposed into the change in the range of the first task's readout (Displacement in range, $CC^{\top}$) and change in its nullspace (Displacement in nullspace, $NN^{\top}$). In both panels points of interest are labelled: the baseline condition, where learning is unrestricted ($\alpha=1, \beta=1$), the condition where learning is restricted completely to the functional nullspace ($\alpha=0, \beta=1$), and the condition where the model's hidden layers are fully frozen ($\alpha=0, \beta=0$).
  • Figure 5: Movement in range and nullspace for gradient decomposition as discussed in Section \ref{['sec:linear']}. The three panels show activation change in range and null space. In each panel the surface is coloured according to a different performance measure analogously to Figure \ref{['fig:conceptual']}.
  • ...and 2 more figures