A study on the plasticity of neural networks
Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, Claudia Clopath
TL;DR
This work investigates a generalisation gap that arises when fine-tuning a model pretrained on the target distribution, versus training from random initialisation. Using CIFAR-10 with ResNet-18, it systematically varies optimisers, pretraining depth, data distribution shifts, multi-stage pretraining, and network capacity to map when and how the gap occurs. It proposes a Two-Phases of Learning account, suggesting pretraining reduces gradient noise during tuning, driving convergence to narrower minima and poorer generalisation, with evidence that increasing the tuning learning rate can mitigate the gap. The results imply careful design of transfer and continual-learning pipelines, including selective reinitialisation of upper layers to preserve plasticity and improve forward transfer.
Abstract
One aim shared by multiple settings, such as continual learning or transfer learning, is to leverage previously acquired knowledge to converge faster on the current task. Usually this is done through fine-tuning, where an implicit assumption is that the network maintains its plasticity, meaning that the performance it can reach on any given task is not affected negatively by previously seen tasks. It has been observed recently that a pretrained model on data from the same distribution as the one it is fine-tuned on might not reach the same generalisation as a freshly initialised one. We build and extend this observation, providing a hypothesis for the mechanics behind it. We discuss the implication of losing plasticity for continual learning which heavily relies on optimising pretrained models.
