To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

Ildus Sadrtdinov; Dmitrii Pozdeev; Dmitry Vetrov; Ekaterina Lobacheva

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

Ildus Sadrtdinov, Dmitrii Pozdeev, Dmitry Vetrov, Ekaterina Lobacheva

TL;DR

The paper addresses how to build high-quality ensembles in transfer learning when only a single pre-trained checkpoint is available. It analyzes local and semi-local ensembling methods and finds that exploring the pre-train basin with Snapshot Ensemble approaches helps but exiting the basin degrades transfer benefits. To resolve this, it introduces StarSSE, a parallel, star-shaped extension that preserves transfer advantages while yielding diverse, high-quality models; it also demonstrates strong model soups from StarSSE ensembles. Across medium and large-scale tasks, StarSSE consistently outperforms standard SSE and Local DE in ensembles and soups, with notable improvements on robustness and scalability. The work advances practical, compute-efficient ensemble methods for transfer learning and offers guidance for leveraging loss-landscape structure in checkpointed regimes.

Abstract

Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

TL;DR

Abstract

Paper Structure (30 sections, 1 equation, 18 figures, 8 tables)

This paper contains 30 sections, 1 equation, 18 figures, 8 tables.

Introduction
Ensembling methods in regular training
Scope of the paper: transfer learning perspective
Experimental setup
Empirical analysis of ensembling in transfer learning
Local and global deep ensembles
Local and semi-local Snapshot Ensembles
Analysis of Snapshot Ensemble behavior
StarSSE --- a better version of Snapshot Ensemble for transfer learning
Soups of SSE and StarSSE models
Robustness analysis
Large-scale experiments
Conclusion
Limitations and societal impact
Experimental setup details
...and 15 more sections

Figures (18)

Figure 1: Results of SSEs (top row) and StarSEEs (bottom row) of different sizes on CIFAR-100 for different types of pre-training and varying values of cycle hyperparameters (maximum learning rate and number of epochs). Local and Global DEs are shown for comparison with gray dotted lines.
Figure 2: Linear connectivity analysis of Local DEs, SSEs (left plots), and StarSSEs (right plots) on CIFAR-100 with self-supervised pre-training. We show train and test accuracy along line segments between two random networks in Local DEs, between the first and the last ($5$-th) network in three differently behaving SSEs, and between the first and any other consequent network in three differently behaving StarSSEs. Hyperparameters for more local and more semi-local experiments are the same for SSE and StarSSE, while hyperparameters for the optimal experiments may differ.
Figure 3: Train and test accuracy of individual models from three differently behaving SSEs (left plots) and StarSSEs (right plots) on CIFAR-100 with self-supervised pre-training (with a single fine-tuned model for comparison). Hyperparameters for more local and more semi-local experiments are the same for SSE and StarSSE, while hyperparameters for the optimal experiments may differ.
Figure 4: Results of ensembles (left plots) and model soups (right plots) of different sizes on ID (CIFAR-100 test set, top row) and OOD (CIFAR-100C, bottom row) for SSE and StarSSE with self-supervised pre-training. For OOD, we measure the average accuracy over all possible corruptions and severity values. Standard deviations are calculated over different pre-training checkpoints and/or fine-tuning random seeds.
Figure 5: Results of ensembles (left plots) and model soups (right plots) of CLIP models on ImageNet. Three differently behaving SSE and StarSSE experiments are shown with Local and Global DEs for comparison. Hyperparameters for more local and more semi-local experiments are the same for SSE and StarSSE, while hyperparameters for the optimal experiments may differ.
...and 13 more figures

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

TL;DR

Abstract

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (18)