Table of Contents
Fetching ...

A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

Michael Munn, Susan Wei

TL;DR

The paper introduces downstream free energy as a Bayesian model selection criterion to identify pretrained checkpoints most adaptable to downstream tasks. It shows how downstream free energy can be bounded by and proxied through pretraining free energy, enabling checkpoint selection without access to downstream data under distribution shift. The authors develop a localized WBIC-based estimator for pretraining free energy and provide theoretical connections between free energies and downstream performance, accompanied by empirical evidence from CIFAR-FS and mini-ImageNet demonstrating that lower pretraining free energy correlates with stronger transfer and few-shot accuracy. This framework offers a principled approach to checkpoint selection that emphasizes adaptability and generalization, with practical implications for large foundation models where downstream data may be scarce.

Abstract

Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability by measuring the concentration of nearby favorable parameters for the downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the criterion reliably correlates with improved finetuning performance, offering a principled approach to predicting model adaptability.

A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

TL;DR

The paper introduces downstream free energy as a Bayesian model selection criterion to identify pretrained checkpoints most adaptable to downstream tasks. It shows how downstream free energy can be bounded by and proxied through pretraining free energy, enabling checkpoint selection without access to downstream data under distribution shift. The authors develop a localized WBIC-based estimator for pretraining free energy and provide theoretical connections between free energies and downstream performance, accompanied by empirical evidence from CIFAR-FS and mini-ImageNet demonstrating that lower pretraining free energy correlates with stronger transfer and few-shot accuracy. This framework offers a principled approach to checkpoint selection that emphasizes adaptability and generalization, with practical implications for large foundation models where downstream data may be scarce.

Abstract

Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability by measuring the concentration of nearby favorable parameters for the downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the criterion reliably correlates with improved finetuning performance, offering a principled approach to predicting model adaptability.
Paper Structure (32 sections, 1 theorem, 41 equations, 3 figures, 1 table)

This paper contains 32 sections, 1 theorem, 41 equations, 3 figures, 1 table.

Key Result

Proposition 5.1

Let ${w^*}$ be a local minimum of $\mathrm K^0(w)$; i.e., ${w^*} \in U_0$ and $\gamma$ be such that ${w^{*0}}$ is a local minimum of $\mathrm K^0(w)$; i.e., ${w^{*0}} \in U_0$. Further suppose $\lambda^1({w^*}) \le \lambda^0({w^*})$. Define $M:= \max_{(x,y) \sim r^0(x,y)} \frac{r^1(x,y)}{r^0(x,y)} < where $D = \int \log \frac{r^1(y|x)}{r^0(y|x)} r^1(x,y) \,dx \,dy.$

Figures (3)

  • Figure 1: We plot pretraining free energy versus two types of transfer accuracy (top and bottom) for checkpoints at the end of pretraining. As expected, checkpoints with lower pretraining free energy, across various pretraining hyperparameters such as learning rate, batch size, and momentum, show higher transfer accuracy. The size of the icons represent magnitude of the hyperparameter value; e.g., a larger triangle means higher momentum. The reported values are averaged over five random seeds. See Section \ref{['section:experiments']} for details.
  • Figure 2: Model checkpoints with lower pretraining WBIC (second column) consistently result in better transfer accuracy, both when fine-tuning on the full downstream dataset (third column) and in the few-shot setting (fourth column). Lower pretraining WBIC correlates with better downstream performance for Top row: larger learning rates, Middle row: smaller batch sizes, and Bottom row: increased momentum. Additional experiments on mini-ImageNet and a VGG model yield similar results; see Figure \ref{['figure:transfer_learning_miniimagenet']} and Appendix \ref{['appendix:additional_experiments_miniImagenet']}.
  • Figure 3: Model checkpoints with lower pretraining WBIC (second column) consistently result in better transfer accuracy, both when fine-tuning on the full downstream dataset (third column) and in the few-shot setting (fourth column). Lower pretraining WBIC correlates with better downstream performance for Top row: larger learning rates, Middle row: smaller batch sizes, and Bottom row: increased momentum.

Theorems & Definitions (6)

  • Remark 4.1
  • Remark 4.2
  • Proposition 5.1
  • proof
  • Example 1: Covariate shift between pretraining and downstream distributions
  • Example 2: Nuisance parameter mismatch between pretrain and downstream distributions