A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning
Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine
TL;DR
The authors address how initialization across layers in a pretraining–finetuning (PT+FT) pipeline biases feature reuse and refinement. By studying diagonal linear networks with gradient-flow training and employing replica theory, they derive exact generalization-error expressions that depend on pretraining scale c_PT, relative scale lambda_PT, and finetuning readout gamma_FT, uncovering four distinct fine-tuning regimes including a pretraining-dependent rich regime. They validate the theory with both controlled diagonal-network experiments and large-scale ResNet experiments on CIFAR-100, demonstrating how the relative initialization across layers governs when feature reuse or fresh feature learning is advantageous. The results provide principled levers for controlling transfer learning behavior and offer insights with potential relevance to both machine learning practice and neuroscience by highlighting the importance of cross-layer initialization scales in enabling continued feature learning during fine-tuning.
Abstract
Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.
