Table of Contents
Fetching ...

Transfer Learning with Informative Priors: Simple Baselines Better than Previously Reported

Ethan Harvey, Mikhail Petrov, Michael C. Hughes

TL;DR

This work reevaluates Bayesian transfer learning with informative priors against standard initialization-based transfer across five datasets with small target sets. It shows that standard transfer learning can outperform previously reported gains from informative priors when hyperparameters are carefully tuned, and that dataset- dependent gains from informative priors exist, with isotropic priors often competitive. It also reveals substantial variability in the supposed mechanism of alignment between train and test loss landscapes, challenging the proposed explanation. To support reproducibility, the authors release code and propose best practices favoring simple, well-tuned baselines over complex priors in most cases. Overall, the findings call for more careful experimental standards in transfer learning research.

Abstract

We pursue transfer learning to improve classifier accuracy on a target task with few labeled examples available for training. Recent work suggests that using a source task to learn a prior distribution over neural net weights, not just an initialization, can boost target task performance. In this study, we carefully compare transfer learning with and without source task informed priors across 5 datasets. We find that standard transfer learning informed by an initialization only performs far better than reported in previous comparisons. The relative gains of methods using informative priors over standard transfer learning vary in magnitude across datasets. For the scenario of 5-300 examples per class, we find negative or negligible gains on 2 datasets, modest gains (between 1.5-3 points of accuracy) on 2 other datasets, and substantial gains (>8 points) on one dataset. Among methods using informative priors, we find that an isotropic covariance appears competitive with learned low-rank covariance matrix while being substantially simpler to understand and tune. Further analysis suggests that the mechanistic justification for informed priors -- hypothesized improved alignment between train and test loss landscapes -- is not consistently supported due to high variability in empirical landscapes. We release code to allow independent reproduction of all experiments.

Transfer Learning with Informative Priors: Simple Baselines Better than Previously Reported

TL;DR

This work reevaluates Bayesian transfer learning with informative priors against standard initialization-based transfer across five datasets with small target sets. It shows that standard transfer learning can outperform previously reported gains from informative priors when hyperparameters are carefully tuned, and that dataset- dependent gains from informative priors exist, with isotropic priors often competitive. It also reveals substantial variability in the supposed mechanism of alignment between train and test loss landscapes, challenging the proposed explanation. To support reproducibility, the authors release code and propose best practices favoring simple, well-tuned baselines over complex priors in most cases. Overall, the findings call for more careful experimental standards in transfer learning research.

Abstract

We pursue transfer learning to improve classifier accuracy on a target task with few labeled examples available for training. Recent work suggests that using a source task to learn a prior distribution over neural net weights, not just an initialization, can boost target task performance. In this study, we carefully compare transfer learning with and without source task informed priors across 5 datasets. We find that standard transfer learning informed by an initialization only performs far better than reported in previous comparisons. The relative gains of methods using informative priors over standard transfer learning vary in magnitude across datasets. For the scenario of 5-300 examples per class, we find negative or negligible gains on 2 datasets, modest gains (between 1.5-3 points of accuracy) on 2 other datasets, and substantial gains (>8 points) on one dataset. Among methods using informative priors, we find that an isotropic covariance appears competitive with learned low-rank covariance matrix while being substantially simpler to understand and tune. Further analysis suggests that the mechanistic justification for informed priors -- hypothesized improved alignment between train and test loss landscapes -- is not consistently supported due to high variability in empirical landscapes. We release code to allow independent reproduction of all experiments.
Paper Structure (28 sections, 6 equations, 7 figures, 12 tables)

This paper contains 28 sections, 6 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Error rate (lower is better) vs. target train set size on CIFAR-10, for various MAP estimation methods for transfer learning from ImageNet. Left: Our results. Right: Results copied from shwartz2022pre (their Tab. 10). Takeaway: In our experiments, standard transfer learning (StdPrior) does better than previously reported.Setting details: The blue and purple lines across both panels come from comparable settings: a common ResNet-50 architecture and common learned values for mean and low-rank (LR) covariance taken directly from the SimCLR pre-trained snapshots in shwartz2022pre's repository. Green line: The left panel's green line is a third-party experiment copied from kaplun2023subtuning, suggesting others can achieve similar performance as we do for standard transfer learning with ResNet-50. They use fully-supervised pre-training not self-supervised SimCLR. Plotted mean and standard deviations confirmed via direct correspondence with kaplun2023subtuning.
  • Figure 2: Empirical alignment of loss landscapes for target task across train and test for CIFAR-10 $n=1000$. Compare to https://proceedings.neurips.cc/paper_files/paper/2022/file/b1e7f61f40d68b2177857bfcb195a507-Paper-Conference.pdf#page=2, which is an idealized illustration not an empirical result. Each panel: We assess a 1D slice of the high-dimensional landscape by linearly interpolating parameters $w$ between the optima $w_{Std}^*$ found via minimizing the standard MAP objective (left) or the optima $w_{LR}^*$ found via minimizing the LearnedPriorLR MAP objective (right) and the optima $w^*$ found at the largest dataset size. Red curve shows the indicated training loss (varies by column) on CIFAR-10 with $n=1000$ samples; blue curve (varies by column) shows CIFAR-10 test set NLL. Each row shows results from a different train set sample of size $n=1000$. The gap between the optima found via training and the test set's ideal minimum is shown as a double-sided gold arrow. Takeaway: shwartz2022pre's learned prior approach does not always reduce the gap between trained and ideal minimum.
  • Figure D.1: Error rate (lower is better) vs. target train set size on CIFAR-10, for standard transfer learning from ImageNet using fully-supervised pre-training. The orange line are results copied from shwartz2022pre (their Tab. 2). The green line is a third-party experiment copied from kaplun2023subtuning. Plotted mean and standard deviations confirmed via direct correspondence with kaplun2023subtuning. Takeaway: In third-party experiments, standard transfer learning (StdPrior) performs better at dataset sizes $n \in \{ 1000, 10000, 50000\}$ than reported in shwartz2022pre.
  • Figure E.1: Expanded version of Fig. \ref{['fig:new_interpolations']}, including a third column for the LearnedPriorIso method. Each panel: We assess a 1D slice of the high-dimensional landscape by linearly interpolating parameters $w$ between the optima $w_{Std}^*$ found via minimizing the standard MAP objective (left), the optima $w_{Iso}^*$ found via minimizing the LearnedPriorIso MAP objective (center), or the optima $w_{LR}^*$ found via minimizing the LearnedPriorLR MAP objective (right) and the optima $w^*$ found at the largest dataset size. Red curve shows the indicated training loss (varies by column) on CIFAR-10 with $n=1000$ samples; blue curve (varies by column) shows CIFAR-10 test set NLL. Each row shows results from a different train set sample of size $n=1000$. The gap between the optima found via training and the test set's ideal minimum is shown as a double-sided gold arrow.
  • Figure E.2: Alternative version of Fig. \ref{['fig:new_interpolations']}, looking at a 1D slice that interpolates between the optima found by each method instead of the optima found by each method and the optima found at the largest dataset size. Each panel: We assess a 1D slice of the high-dimensional landscape by linearly interpolating parameters $w$ between the optima $w_{Std}^*$ found via minimizing the standard MAP objective and optima $w_{LR}^*$ found via minimizing the LearnedPriorLR MAP objective. Red curve shows the indicated training loss (varies by column) on CIFAR-10 with $n=1000$ samples; blue curve (same across columns) shows CIFAR-10 test set NLL. Each row shows results from a different train set sample of size $n=1000$. The gap between the optima found via training and the test set's minimum on the 1D slice is shown as a double-sided gold arrow.
  • ...and 2 more figures