Table of Contents
Fetching ...

A Theory of Initialisation's Impact on Specialisation

Devon Jarvis, Sebastian Lee, Clémentine Carla Juliette Dominé, Andrew M Saxe, Stefano Sarao Mannelli

TL;DR

The paper investigates how initialisation biases influence representation specialisation and forgetting in continual learning. Using deep linear dynamics and mean-field analyses, it shows that weight imbalance and readout entropy favor specialised representations, producing distinct forgetting profiles (Maslow's hammer vs monotonic forgetting) and modulating Elastic Weight Consolidation performance. Empirical support spans disentangled learning with $\beta$-VAE and two-task teacher-student continual-learning setups, providing a principled initialization-based lever to control forgetting and inform regularisation strategies. The work highlights fundamental links between initial conditions, representation structure, and transfer in lifelong learning, with practical implications for designing more robust continual-learning systems and disentanglement-focused models.

Abstract

Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network's tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. {Finally, we show that specialization by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.

A Theory of Initialisation's Impact on Specialisation

TL;DR

The paper investigates how initialisation biases influence representation specialisation and forgetting in continual learning. Using deep linear dynamics and mean-field analyses, it shows that weight imbalance and readout entropy favor specialised representations, producing distinct forgetting profiles (Maslow's hammer vs monotonic forgetting) and modulating Elastic Weight Consolidation performance. Empirical support spans disentangled learning with -VAE and two-task teacher-student continual-learning setups, providing a principled initialization-based lever to control forgetting and inform regularisation strategies. The work highlights fundamental links between initial conditions, representation structure, and transfer in lifelong learning, with practical implications for designing more robust continual-learning systems and disentanglement-focused models.

Abstract

Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network's tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. {Finally, we show that specialization by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.

Paper Structure

This paper contains 19 sections, 46 equations, 13 figures, 1 algorithm.

Figures (13)

  • Figure 1: Initialisation impacts specialisation. a) In the teacher-student setup a student network is trained with labels generated by a fixed teacher network. Previous work established a relationship between the activation function $\phi$ and the propensity for the student nodes to specialise to teacher nodes. However we show in this work that this is an overly simplistic description; other factors including student weight initialisations $I_W, I_h$, parameterised by $\Theta_W, \Theta_h$ arguably play a stronger role. b) Generalisation error curves for two simulations of the teacher-student setup, one with a ReLU activation function and one with a scaled error activation function. $\Theta_W$ and $\Theta_h$ are chosen to achieve a solution with ReLU that specialises---as indicated by sparser overlap matrices on the bottom right, and a scaled error function solution that does not specialise---as indicated by denser overlap matrices on the top right. A sparse (dense) $Q$ matrix shows few (many) student nodes are active, while a sparse (dense) $R$ matrix shows student nodes are representing teacher nodes in a targeted (redundant) manner. Further details for the quantities described can be found in \ref{['sec:cl']}.
  • Figure 2: Summary of our setup, notation and strategy. a) The original network with two hidden neurons learning the regression task. b) We split the network into two separate pathways and consider their dynamics individually. Since both networks are learning the same task simultaneously, their dynamics are coupled. c) To obtain the dynamics of the two pathways and calculate their escaping and hitting time we track the pathway dynamics in terms of the network's effective singular values. The closed form dynamics for the pathway singular value are given in Eq. \ref{['eqn:linear_dynam']}.
  • Figure 3: Linear Dynamics from imbalanced initialisation leads to specialisation.Panels a-b) Show agreement between our theoretical curves and simulations for the training dynamics of: (a) the network's singular value dynamics, escaping times (verticals towards left) and hitting times (verticals towards right) for varying scales of weight imbalance $\lambda$ (depicted by colour), (b) and the network's movement in weight space depicted by the sequence of dots over weight space. Colour depicts the loss of the network configuration at a point. Panel c) shows a phase diagram representing how pathways with different initial weight imbalances lead to specialisation. The two axis represent the weight imbalance of the two pathways in our broader network ($\lambda_2$ on the x-axis for the slower pathway and $\lambda_1$ on the y-axis for the faster pathway). The colour represents how close the slower pathway is to reaching its escaping time at its closest point throughout training (in $\log$ scale). We see that the more inbalanced the fast pathway relative to the slower pathway, the more likely the network will specialise. The white region represents when the inbalance is equal or reversed.
  • Figure 4: Violin plots of a) the Disentanglement, Completeness, and Informativenes (DCI) eastwood2018framework score and b) the reconstruction loss against gain. The disentanglement score decreases as the gain increases while the reconstruction loss remains steady, c) Example traversals of models with gains $2$ and $0.3$, respectively, highlighting a disentangled dimension for gain $0.3$ and a mixed dimension for gain $2$. Experimental details can be found in appendix \ref{['app:disentenglement']}.
  • Figure 5: Phase diagrams show significance of initialisation for specialisation. The phase diagrams show with colour the aggregated entropy Eq. \ref{['eq:entropies']} evaluated for different initialisations. On the x-axis we span over the standard deviation of the first layer. The second layer is initialised using polar coordinates, and the y-axis represents the norm while the different panels give the angle spanning from orthogonal units ($\theta=0$) to identical units ($\theta=\pi/4$). Specialisation is achieved by blue-leaning initialisations, while yellow-leaning ones exhibit high entropy and therefore non-specialised solutions. Additional results can be found in \ref{['app:further_phase']}.
  • ...and 8 more figures