Table of Contents
Fetching ...

Transfer Learning in Infinite Width Feature Learning Networks

Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

Abstract

We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amount of data on both tasks, the alignment between tasks, and the feature learning strength.

Transfer Learning in Infinite Width Feature Learning Networks

Abstract

We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amount of data on both tasks, the alignment between tasks, and the feature learning strength.

Paper Structure

This paper contains 54 sections, 188 equations, 15 figures.

Figures (15)

  • Figure 1: Fine-tuning from an adaptive kernel from $\mathcal{T}_1$. Dashed black: no pre-training (linear probe). (a) Loss is strictly decreasing with source/target alignment $\alpha_s$ (Result 2). (b) Non-zero alignment with the noise direction ($\alpha_g = 0.1$) can cause negative transfer at high $\nu_2 = P_2/D$ (Result 3). (c) Test loss on $\mathcal{T}_2$ depends only on source data $\nu_1 = P_1/D$ and the alignments $\{\alpha_g, \alpha_s\}$ (Result 4. In the panel, $\alpha_g = 0$).
  • Figure 2: Fine-tuning with adaptive kernels from $\mathcal{T}_1$. Losses vs $\nu_2$ and for different $\gamma_1$ values on $\mathcal{T}_1$. (a) Linear model from Result \ref{['th::result2']} when $c_1=\nu_1\sqrt{\nu_1 (1-\nu_1)}\chi, c_2 = \nu_1^2 \chi, c_3 = \nu_1 (1-\nu_1)\chi$ with $\chi = \sqrt{1-\gamma_1^2}-1$ has optimal $\gamma_1$ at large $\nu_2$. (b)/(c) Two-layer ReLU MLP on CIFAR10: source task is regression on $\{0,1\}$ classes; target task is regression on $\{0,9\}$ classes. Theory is obtained by performing kernel regression on $\mathcal{T}_2$ from the adaptive kernel after $\mathcal{T}_1$.
  • Figure 3: Test losses of a two-layer ReLU MLP vs steps for different feature learning strength $\gamma_2$ on $\mathcal{T}_2$. (a) Low degree polynomial source task $y_1(\bm x) = D^{-1/2}\bm \beta \cdot {\bm x}$ with $P_1 = 1000$, $D=100$ and $\gamma_1 = 1.0$. Target task is $y_2(\bm x) = (D^{-1/2}\bm \beta \cdot \bm x)^2$ with $P_2 = 100$. (b) Source task $\mathrm{He}_{5}(\bm\beta_{1}\cdot \bm x)$ with $P_{1}=1000$ and $\gamma_1 = 1.0$. Target task: $\mathrm{He}_{2}(\bm \beta_{2} \cdot \bm x)$ with $P_{2}=600$ and $\bm \beta_{1}\cdot\bm \beta_{2}=0.8$. Solid lines: gradient‐descent on an $N=20000$ two-layer ReLU network. Dashed lines: DMFT theory from \ref{['th::dmft_tl']}.
  • Figure 4: (a)/(b) Transfer learning is beneficial for real tasks at any feature learning strength $\gamma_2$. Source task: classes $1/2$ of CIFAR-10 with $P_1 = 10K$ and $\gamma_1 = 1.0$. Target task: classes $8/9$ of CIFAR-10 with $P_2 = 200$. (c) Preactivation distribution of the target model for different $\gamma_2$. Solid lines: GD at convergence ($N=20000$, two-layer ReLU MLP); black dashed lines: DMFT from \ref{['th::dmft_tl']}.
  • Figure 5: Test losses as a function of target data $P_2$ for different feature learning strength $\gamma_2$ on downstream task. Source task is a regression on two classes ($0/1$) of CIFAR with $P_1=1000$ labels $\bar{y}\in \{-1,1\}^{P_1}$ and richness $\gamma_1 = 1.0$. Target task is a regression on two classes of CIFAR ($0/9$) with $P_2$ data points and labels $y \in \{-1,1\}^{P_2}$.
  • ...and 10 more figures