A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints
Michael Munn, Susan Wei
TL;DR
The paper introduces downstream free energy as a Bayesian model selection criterion to identify pretrained checkpoints most adaptable to downstream tasks. It shows how downstream free energy can be bounded by and proxied through pretraining free energy, enabling checkpoint selection without access to downstream data under distribution shift. The authors develop a localized WBIC-based estimator for pretraining free energy and provide theoretical connections between free energies and downstream performance, accompanied by empirical evidence from CIFAR-FS and mini-ImageNet demonstrating that lower pretraining free energy correlates with stronger transfer and few-shot accuracy. This framework offers a principled approach to checkpoint selection that emphasizes adaptability and generalization, with practical implications for large foundation models where downstream data may be scarce.
Abstract
Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint's adaptability by measuring the concentration of nearby favorable parameters for the downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the criterion reliably correlates with improved finetuning performance, offering a principled approach to predicting model adaptability.
