A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

Nicolas Anguita; Francesco Locatello; Andrew M. Saxe; Marco Mondelli; Flavia Mancini; Samuel Lippl; Clementine Domine

A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

Nicolas Anguita, Francesco Locatello, Andrew M. Saxe, Marco Mondelli, Flavia Mancini, Samuel Lippl, Clementine Domine

TL;DR

The authors address how initialization across layers in a pretraining–finetuning (PT+FT) pipeline biases feature reuse and refinement. By studying diagonal linear networks with gradient-flow training and employing replica theory, they derive exact generalization-error expressions that depend on pretraining scale c_PT, relative scale lambda_PT, and finetuning readout gamma_FT, uncovering four distinct fine-tuning regimes including a pretraining-dependent rich regime. They validate the theory with both controlled diagonal-network experiments and large-scale ResNet experiments on CIFAR-100, demonstrating how the relative initialization across layers governs when feature reuse or fresh feature learning is advantageous. The results provide principled levers for controlling transfer learning behavior and offer insights with potential relevance to both machine learning practice and neuroscience by highlighting the importance of cross-layer initialization scales in enabling continued feature learning during fine-tuning.

Abstract

Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.

A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

TL;DR

Abstract

Paper Structure (70 sections, 5 theorems, 105 equations, 9 figures)

This paper contains 70 sections, 5 theorems, 105 equations, 9 figures.

Introduction
Related Work
Theoretical Setup
Network Architecture
Network Training
Data Generative Model
Theoretical Characterization of Generalization Error in Fine-Tuning
Implicit Inductive Bias of Fine-Tuning
Replica Theory of the Generalization Error
Understanding Learning Regimes in PT+FT
Learning Regimes in the Limit
Full Phase Portrait of the Learning Regimes
Learning Regime and Task Parameters Jointly Determine Generalization Error Curves
Large-Scale Vision Experiments
Conclusion
...and 55 more sections

Key Result

Theorem 4.1

Consider a diagonal linear network trained under the paradigm described in Section sec:training. Then, the gradient flow solution at convergence is given by

Figures (9)

Figure 1: Setup. Schematic illustration of the theoretical setup: dependence on the initialization parameters $c_{PT}$, $\lambda_{PT}$, and $\gamma_{FT}$ (shown in pink); and dependence on the data parameters $\rho_{PT}$, $\rho_{FT}^{\text{shared}}$, and $\rho_{FT}^{\text{new}}$ (shown in grey).
Figure 2: Implicit bias and learning regimes of fine-tuning. (a) $\ell$-order and pretraining dependence jointly define four different learning regimes in PT+FT. Different initialization parameters induce changes between these learning regimes, as indicated by the arrows. (b) We can interpolate between these four regimes (for a color legend, see panel (a)) by shifting the initialization parameters ($\hat{\beta}_{FT,d}=1/\sqrt{\rho_{FT}}$, $\rho_{FT}=0.1$). Crucial transitions, which we further highlight in the text and in panel (a), are indicated by the arrows. Simulation details can be found in Appendix \ref{['app:implementation']}.
Figure 3: Generalization curves for different initialization parameters and task parameters. The generalization error $\mathcal{E}$ as a function of the data scale $\alpha_{FT}$. Lines depict replica predictions and points depict the results of our empirical simulations ($\pm2$ standard errors). In all cases, $\rho_{PT}=0.1$. We consider $c_{PT}=10^{-3}$, $\lambda_{PT}=0$, and $\gamma_{FT}=0$, varying one initialization parameter for each panel. Simulation details can be found in Appendix \ref{['app:implementation']}. (a-c) We consider either overlapping ($\rho_{FT}^{\text{shared}}=0.1,\rho_{FT}^{\text{new}}=0$) or distinct ($\rho_{FT}^{\text{shared}}=0,\rho_{FT}^{\text{new}}=0.1$) FT dimensions. (d,e) We consider $\rho_{FT}^{\text{shared}}=0$ and vary $\rho_{FT}^{\text{new}}$. (f) We consider $\rho_{FT}^{\text{new}}=0$ and vary $\rho_{FT}^{\text{shared}}$.
Figure 4: ResNet CIFAR-100. Generalization performance as a function of the number of samples and initialization parameters. We vary (a) the absolute scale of initialization by multiplying all weights in the network by $c_{PT}$, (b) the relative scale of initialization by multiplying the first three blocks of the ResNet by $\kappa$ (equivalent to $\lambda_{PT}$), and (c) the readout initialization by multiplying the readout by $\gamma_{FT}$ (Simulation details: Appendix \ref{['app:implementation']}).
Figure 5: Comparison of the four fine-tuning learning regimes. We show three task parameter settings: 1) no overlap between pretraining and fine-tuning dimensions; 2) identical pretraining and fine-tuning dimensions; 3) fine-tuning dimension as a subset of pretraining dimensions. We show for different initialization parameter settings: a lazy, pretraining-dependent regime (II, shown in purple), a lazy, pretraining-independent regime (III, shown in grey), an intermediate rich, pretraining-dependent regime (IV, shown in green), and a richer, less pretraining-dependent regime (I, shown in yellow). We observe that for the task parameter setting without any overlap, the regime approaching the rich, pretraining-independent regime (in yellow) is optimal. For the complete overlap between pretraining and fine-tuning dimensions, the lazy, pretraining-dependent regime (II, in purple) is optimal. Finally, for the task where the fine-tuning dimensions are a subset of pretraining dimensions, the rich, pretraining-dependent regime (IV, in green) is optimal. All of these observations are predicted by our theoretical insight in the inductive bias of PT+FT.
...and 4 more figures

Theorems & Definitions (11)

Theorem 4.1: Implicit bias
proof
Proposition 4.2
proof
Proposition 5.1
proof
Theorem 2.1: Implicit bias
proof
Proposition 2.1
proof
...and 1 more

A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

TL;DR

Abstract

A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (11)