Transfer Learning with Informative Priors: Simple Baselines Better than Previously Reported
Ethan Harvey, Mikhail Petrov, Michael C. Hughes
TL;DR
This work reevaluates Bayesian transfer learning with informative priors against standard initialization-based transfer across five datasets with small target sets. It shows that standard transfer learning can outperform previously reported gains from informative priors when hyperparameters are carefully tuned, and that dataset- dependent gains from informative priors exist, with isotropic priors often competitive. It also reveals substantial variability in the supposed mechanism of alignment between train and test loss landscapes, challenging the proposed explanation. To support reproducibility, the authors release code and propose best practices favoring simple, well-tuned baselines over complex priors in most cases. Overall, the findings call for more careful experimental standards in transfer learning research.
Abstract
We pursue transfer learning to improve classifier accuracy on a target task with few labeled examples available for training. Recent work suggests that using a source task to learn a prior distribution over neural net weights, not just an initialization, can boost target task performance. In this study, we carefully compare transfer learning with and without source task informed priors across 5 datasets. We find that standard transfer learning informed by an initialization only performs far better than reported in previous comparisons. The relative gains of methods using informative priors over standard transfer learning vary in magnitude across datasets. For the scenario of 5-300 examples per class, we find negative or negligible gains on 2 datasets, modest gains (between 1.5-3 points of accuracy) on 2 other datasets, and substantial gains (>8 points) on one dataset. Among methods using informative priors, we find that an isotropic covariance appears competitive with learned low-rank covariance matrix while being substantially simpler to understand and tune. Further analysis suggests that the mechanistic justification for informed priors -- hypothesized improved alignment between train and test loss landscapes -- is not consistently supported due to high variability in empirical landscapes. We release code to allow independent reproduction of all experiments.
