Table of Contents
Fetching ...

A study on the plasticity of neural networks

Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, Claudia Clopath

TL;DR

This work investigates a generalisation gap that arises when fine-tuning a model pretrained on the target distribution, versus training from random initialisation. Using CIFAR-10 with ResNet-18, it systematically varies optimisers, pretraining depth, data distribution shifts, multi-stage pretraining, and network capacity to map when and how the gap occurs. It proposes a Two-Phases of Learning account, suggesting pretraining reduces gradient noise during tuning, driving convergence to narrower minima and poorer generalisation, with evidence that increasing the tuning learning rate can mitigate the gap. The results imply careful design of transfer and continual-learning pipelines, including selective reinitialisation of upper layers to preserve plasticity and improve forward transfer.

Abstract

One aim shared by multiple settings, such as continual learning or transfer learning, is to leverage previously acquired knowledge to converge faster on the current task. Usually this is done through fine-tuning, where an implicit assumption is that the network maintains its plasticity, meaning that the performance it can reach on any given task is not affected negatively by previously seen tasks. It has been observed recently that a pretrained model on data from the same distribution as the one it is fine-tuned on might not reach the same generalisation as a freshly initialised one. We build and extend this observation, providing a hypothesis for the mechanics behind it. We discuss the implication of losing plasticity for continual learning which heavily relies on optimising pretrained models.

A study on the plasticity of neural networks

TL;DR

This work investigates a generalisation gap that arises when fine-tuning a model pretrained on the target distribution, versus training from random initialisation. Using CIFAR-10 with ResNet-18, it systematically varies optimisers, pretraining depth, data distribution shifts, multi-stage pretraining, and network capacity to map when and how the gap occurs. It proposes a Two-Phases of Learning account, suggesting pretraining reduces gradient noise during tuning, driving convergence to narrower minima and poorer generalisation, with evidence that increasing the tuning learning rate can mitigate the gap. The results imply careful design of transfer and continual-learning pipelines, including selective reinitialisation of upper layers to preserve plasticity and improve forward transfer.

Abstract

One aim shared by multiple settings, such as continual learning or transfer learning, is to leverage previously acquired knowledge to converge faster on the current task. Usually this is done through fine-tuning, where an implicit assumption is that the network maintains its plasticity, meaning that the performance it can reach on any given task is not affected negatively by previously seen tasks. It has been observed recently that a pretrained model on data from the same distribution as the one it is fine-tuned on might not reach the same generalisation as a freshly initialised one. We build and extend this observation, providing a hypothesis for the mechanics behind it. We discuss the implication of losing plasticity for continual learning which heavily relies on optimising pretrained models.

Paper Structure

This paper contains 17 sections, 11 figures.

Figures (11)

  • Figure 1: Our reproduction of the core experiment performed by ash2019difficulty. A ResNet-18 model is pretrained on half of the CIFAR 10 training data, and then tuned on the full training set. It generalises worse than the model trained from scratch.
  • Figure 2: Average test accuracy in the last 100 epochs of tuning after pretraining the model for different numbers of epochs using Adam with a constant learning rate ($10^{-3}$) for both phases.
  • Figure 3: Models trained in a single stage where each example is individually sampled with probability $p=1-\gamma^{50 n/N}$ from the full training data, and with probability $1-p$ from the pretrain set ($n$ is the current step, while $N$ represents the total number of steps -- the equivalent of 500 epochs). A few more details in Section \ref{['sec:blending']}.
  • Figure 4: The same model (ResNet18) was trained in multiple stages. All but the last pretrain stages consisted of a number of steps proportional with the number of examples and sufficient to reach 100% accuracy on train. Right: New data for a particular stage has a ratio of examples drawn uniformly from the training set, and the rest from classes designated for that stage (see Section \ref{['sec:splits']} for details).
  • Figure 5: Average performance on the test set for residual networks of various depths and widths. See Section \ref{['sec:resnets']} for details on the models' architectures.
  • ...and 6 more figures