Table of Contents
Fetching ...

Deep Ensembles for Low-Data Transfer Learning

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, André Susano Pinto, Daniel Keysers, Neil Houlsby

TL;DR

The paper tackles the challenge of low-data transfer learning by constructing ensembles from a large pool of pre-trained models and extracting diversity from upstream pre-training. It introduces a practical algorithm that uses $k$NN leave-one-out accuracy to rank pre-trained models, fine-tunes a subset with a small hyperparameter sweep, and greedily builds an ensemble to minimize validation cross-entropy. On 19 VTAB tasks, the proposed upstream-diversity ensembles achieve state-of-the-art performance with substantially lower inference budgets and exhibit improved robustness to distribution shift. The work demonstrates that leveraging upstream pre-training diversity, coupled with cheap model selection, can outperform traditional downstream diversity methods and scale across thousands of pre-trained candidates.

Abstract

In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.

Deep Ensembles for Low-Data Transfer Learning

TL;DR

The paper tackles the challenge of low-data transfer learning by constructing ensembles from a large pool of pre-trained models and extracting diversity from upstream pre-training. It introduces a practical algorithm that uses NN leave-one-out accuracy to rank pre-trained models, fine-tunes a subset with a small hyperparameter sweep, and greedily builds an ensemble to minimize validation cross-entropy. On 19 VTAB tasks, the proposed upstream-diversity ensembles achieve state-of-the-art performance with substantially lower inference budgets and exhibit improved robustness to distribution shift. The work demonstrates that leveraging upstream pre-training diversity, coupled with cheap model selection, can outperform traditional downstream diversity methods and scale across thousands of pre-trained candidates.

Abstract

In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.

Paper Structure

This paper contains 35 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the different ways of constructing diverse ensembles studied in this work. We propose an algorithm that exploits diversity in a large pool of pre-trained models, by using leave-one-out $k$-nearest-neighbour ($k$NN) accuracy to select a subset to form the ensemble.
  • Figure 2: Inference cost vs. VTAB1K performance. State-of-the-art generalist models of different scales are compared against ensembles with varying inference budgets.
  • Figure 3: Effect of fine-tuning budget on ensemble VTAB1K performance.
  • Figure 4: Expert ensembles retain higher accuracy under domain shift. Aside from the first bar, which shows test accuracy, all other bars correspond to some form of induced distribution shift, either artificially or otherwise. In all but one, we get significant boosts in accuracy compared to the HyperEnsembles.
  • Figure 5: VTAB1K validation accuracy of HyperEnsembles trained from scratch and HyperEnsembles trained from a generalist pre-trained model (JFT-R50).
  • ...and 8 more figures