Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Talfan Evans; Shreya Pathak; Hamza Merzic; Jonathan Schwarz; Ryutaro Tanno; Olivier J. Henaff

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Talfan Evans, Shreya Pathak, Hamza Merzic, Jonathan Schwarz, Ryutaro Tanno, Olivier J. Henaff

TL;DR

This work proposes a method, leveraging small, cheap proxy models to estimate "learnability" scores for datapoints, which is used to prioritize data for the training of much larger models, yielding a new state-of-the-art in several multimodal transfer tasks.

Abstract

Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples. Despite their appeal, these methods have yet to be widely adopted since no one algorithm has been shown to a) generalize across models and tasks b) scale to large datasets and c) yield overall FLOP savings when accounting for the overhead of data selection. In this work we propose a method which satisfies these three properties, leveraging small, cheap proxy models to estimate "learnability" scores for datapoints, which are used to prioritize data for the training of much larger models. As a result, our models require 46% and 51% fewer training updates and up to 25% less total computation to reach the same performance as uniformly trained visual classifiers on JFT and multimodal models on ALIGN. Finally, we find our data-prioritization scheme to be complementary with recent data-curation and learning objectives, yielding a new state-of-the-art in several multimodal transfer tasks.

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

TL;DR

Abstract

Paper Structure (26 sections, 15 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 26 sections, 15 equations, 8 figures, 6 tables, 2 algorithms.

Introduction
Related Work
Methods
Data selection as prioritized replay
Statistics for data selection
Unlocking compute-positive training
Losses for canonical visual pre-training tasks
Experiments
Evaluating loss-based scoring heuristics in the large-data regime
Generalising data-selection policies across scale
Generalising neural scaling laws to the active-learning setting
Training the reference model in parallel
ActiveCLIP: active multimodal learning
Policy generalization across tasks
Comparison to prior multimodal art
...and 11 more sections

Figures (8)

Figure 1: Active learning accelerates large-scale visual understanding. For large-scale classification and multimodal learning tasks, prioritised training on data selected using our active selection methods ClassAct (left) and ActiveCLIP (right) requires significantly fewer updates to reach the final performance of uniform training.
Figure 2: Amortizing the cost of data selection. Drawn to scale: length of bars indicates number of FLOPs required to reach the accuracy of a ViT-L trained with uniform sampling (“ViT-L Uniform Sampling”, see Figure \ref{['fig:fig_3_generalisation']}). Expensive model policies (e.g. a ViT-B scores data for the ViT-L learner, or 'B $\rightarrow$ L') produce large learner speedups, at the expense of the additional FLOPs associate with data selection. This overhead can be reduced by deriving the data-selection policies from smaller models (e.g. ViT-S, ViT-Ti or ViT-Mu score data for the ViT-L learner), at the expense of marginal decreases in the learner speedup. Costs could be additionally amortized by using off-the-shelf reference models, removing the need to train from scratch (yellow). Since the reference model is fixed throughout training, scores can be assigned once to a 'foundation dataset' and amortized across many training runs (lime green; sorscher2022beyondmindermann2022prioritized). Since the online model is independent of the learner model and generalizes across scale, data selection policies can also be distilled as a fixed ordering of a given dataset (a 'foundation curriculum').
Figure 3: Evaluation of loss-based data-selection criteria for large-scale classification. We train a ViT-B on JFT-300M with different data-selection policies. Prioritising hard data under the learner (green curve) produced marginal gains over the uniform sampling baseline. Prioritizing data using both learnability (blue curve, mindermann2022prioritized) and easy reference prioritization (red curve, hessel2021clipscore) produced significant speedups and performance gains.
Figure 4: Generalization of data-selection policies across models scales.Left: We train a ViT-L for 3 epochs on JFT using uniform sampling (grey) or prioritized data sampling using example learnability (blue) or low-loss under the reference model (red). Example scores are computed using ViT-B actors (dark), or cheaper ViT-S or ViT-Tiny models (light). While both example learnability and "easy reference" yield good speedups with expensive actors, learnability criteria are much more robust to approximate scoring. Top right: Learner (ViT-L) speedup is computed as the fraction of learner iterations saved in order to attain the baseline's top performance. Actor overhead is computed as the additional computation in FLOPs required to score examples with a particular actor architecture (varying from ViT-Mu to ViT-L, see Appendix Table \ref{['tab:small_vits']}). Example learnability yields robust learner speedups across actor scales, "easy reference" scoring does not. Lower right: total compute efficiency is calculated as a product of learner efficiency and actor overhead, indicating the amount of computation required to reach baseline performance. Approximate actors (i.e. ViT-S or smaller) computing example learnability enable total compute speedups, other schemes do not.
Figure 5: Scaling laws for active learning. We trained a baseline ViT-L over a range of compute budgets (for which ViT-L is compute optimal, see Zhai et al., 2021). We also trained the same ViT-L with both ViT-Ti and ViT-S reference policies, pre-trained for the same number of epochs. Left: Small model policies produce robust savings in learner compute. Right: When accounting for total compute (learner + actor training and data scoring), small model policies in all compute budgets produce FLOP savings over training with uniform samples. These scaling laws generalize those measured empirically in the uniform sampling setting zhai2022scaling to the case of non-uniform data selection.
...and 3 more figures

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

TL;DR

Abstract

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)