A Theory of Initialisation's Impact on Specialisation

Devon Jarvis; Sebastian Lee; Clémentine Carla Juliette Dominé; Andrew M Saxe; Stefano Sarao Mannelli

A Theory of Initialisation's Impact on Specialisation

Devon Jarvis, Sebastian Lee, Clémentine Carla Juliette Dominé, Andrew M Saxe, Stefano Sarao Mannelli

TL;DR

The paper investigates how initialisation biases influence representation specialisation and forgetting in continual learning. Using deep linear dynamics and mean-field analyses, it shows that weight imbalance and readout entropy favor specialised representations, producing distinct forgetting profiles (Maslow's hammer vs monotonic forgetting) and modulating Elastic Weight Consolidation performance. Empirical support spans disentangled learning with $\beta$-VAE and two-task teacher-student continual-learning setups, providing a principled initialization-based lever to control forgetting and inform regularisation strategies. The work highlights fundamental links between initial conditions, representation structure, and transfer in lifelong learning, with practical implications for designing more robust continual-learning systems and disentanglement-focused models.

Abstract

Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network's tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. {Finally, we show that specialization by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.

A Theory of Initialisation's Impact on Specialisation

TL;DR

Abstract

A Theory of Initialisation's Impact on Specialisation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)