Table of Contents
Fetching ...

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

Ildus Sadrtdinov, Dmitrii Pozdeev, Dmitry Vetrov, Ekaterina Lobacheva

TL;DR

The paper addresses how to build high-quality ensembles in transfer learning when only a single pre-trained checkpoint is available. It analyzes local and semi-local ensembling methods and finds that exploring the pre-train basin with Snapshot Ensemble approaches helps but exiting the basin degrades transfer benefits. To resolve this, it introduces StarSSE, a parallel, star-shaped extension that preserves transfer advantages while yielding diverse, high-quality models; it also demonstrates strong model soups from StarSSE ensembles. Across medium and large-scale tasks, StarSSE consistently outperforms standard SSE and Local DE in ensembles and soups, with notable improvements on robustness and scalability. The work advances practical, compute-efficient ensemble methods for transfer learning and offers guidance for leveraging loss-landscape structure in checkpointed regimes.

Abstract

Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

TL;DR

The paper addresses how to build high-quality ensembles in transfer learning when only a single pre-trained checkpoint is available. It analyzes local and semi-local ensembling methods and finds that exploring the pre-train basin with Snapshot Ensemble approaches helps but exiting the basin degrades transfer benefits. To resolve this, it introduces StarSSE, a parallel, star-shaped extension that preserves transfer advantages while yielding diverse, high-quality models; it also demonstrates strong model soups from StarSSE ensembles. Across medium and large-scale tasks, StarSSE consistently outperforms standard SSE and Local DE in ensembles and soups, with notable improvements on robustness and scalability. The work advances practical, compute-efficient ensemble methods for transfer learning and offers guidance for leveraging loss-landscape structure in checkpointed regimes.

Abstract

Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.
Paper Structure (30 sections, 1 equation, 18 figures, 8 tables)

This paper contains 30 sections, 1 equation, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Results of SSEs (top row) and StarSEEs (bottom row) of different sizes on CIFAR-100 for different types of pre-training and varying values of cycle hyperparameters (maximum learning rate and number of epochs). Local and Global DEs are shown for comparison with gray dotted lines.
  • Figure 2: Linear connectivity analysis of Local DEs, SSEs (left plots), and StarSSEs (right plots) on CIFAR-100 with self-supervised pre-training. We show train and test accuracy along line segments between two random networks in Local DEs, between the first and the last ($5$-th) network in three differently behaving SSEs, and between the first and any other consequent network in three differently behaving StarSSEs. Hyperparameters for more local and more semi-local experiments are the same for SSE and StarSSE, while hyperparameters for the optimal experiments may differ.
  • Figure 3: Train and test accuracy of individual models from three differently behaving SSEs (left plots) and StarSSEs (right plots) on CIFAR-100 with self-supervised pre-training (with a single fine-tuned model for comparison). Hyperparameters for more local and more semi-local experiments are the same for SSE and StarSSE, while hyperparameters for the optimal experiments may differ.
  • Figure 4: Results of ensembles (left plots) and model soups (right plots) of different sizes on ID (CIFAR-100 test set, top row) and OOD (CIFAR-100C, bottom row) for SSE and StarSSE with self-supervised pre-training. For OOD, we measure the average accuracy over all possible corruptions and severity values. Standard deviations are calculated over different pre-training checkpoints and/or fine-tuning random seeds.
  • Figure 5: Results of ensembles (left plots) and model soups (right plots) of CLIP models on ImageNet. Three differently behaving SSE and StarSSE experiments are shown with Local and Global DEs for comparison. Hyperparameters for more local and more semi-local experiments are the same for SSE and StarSSE, while hyperparameters for the optimal experiments may differ.
  • ...and 13 more figures