Table of Contents
Fetching ...

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

Gabriele Prato, Simon Guiroy, Ethan Caballero, Irina Rish, Sarath Chandar

TL;DR

This study investigates how scaling training data and the number of training classes affect the few-shot generalization of pre-trained image classifiers to unseen classes and distributions. By evaluating multiple architectures and three evaluation paradigms (fine-tuning, prototypical networks, matching networks) on ten Meta-Dataset datasets, the authors demonstrate that few-shot performance follows power-law trends with respect to both data size and class count. A key finding is that few-shot performance on new classes often converges faster than standard in-distribution accuracy, underscoring scaling laws as a valuable lens for out-of-distribution generalization. The results provide actionable insights for selecting scaling strategies and contribute to the empirical understanding of how scale influences generalization in vision tasks.

Abstract

Empirical science of neural scaling laws is a rapidly growing area of significant importance to the future of machine learning, particularly in the light of recent breakthroughs achieved by large-scale pre-trained models such as GPT-3, CLIP and DALL-e. Accurately predicting the neural network performance with increasing resources such as data, compute and model size provides a more comprehensive evaluation of different approaches across multiple scales, as opposed to traditional point-wise comparisons of fixed-size models on fixed-size benchmarks, and, most importantly, allows for focus on the best-scaling, and thus most promising in the future, approaches. In this work, we consider a challenging problem of few-shot learning in image classification, especially when the target data distribution in the few-shot phase is different from the source, training, data distribution, in a sense that it includes new image classes not encountered during training. Our current main goal is to investigate how the amount of pre-training data affects the few-shot generalization performance of standard image classifiers. Our key observations are that (1) such performance improvements are well-approximated by power laws (linear log-log plots) as the training set size increases, (2) this applies to both cases of target data coming from either the same or from a different domain (i.e., new classes) as the training data, and (3) few-shot performance on new classes converges at a faster rate than the standard classification performance on previously seen classes. Our findings shed new light on the relationship between scale and generalization.

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

TL;DR

This study investigates how scaling training data and the number of training classes affect the few-shot generalization of pre-trained image classifiers to unseen classes and distributions. By evaluating multiple architectures and three evaluation paradigms (fine-tuning, prototypical networks, matching networks) on ten Meta-Dataset datasets, the authors demonstrate that few-shot performance follows power-law trends with respect to both data size and class count. A key finding is that few-shot performance on new classes often converges faster than standard in-distribution accuracy, underscoring scaling laws as a valuable lens for out-of-distribution generalization. The results provide actionable insights for selecting scaling strategies and contribute to the empirical understanding of how scale influences generalization in vision tasks.

Abstract

Empirical science of neural scaling laws is a rapidly growing area of significant importance to the future of machine learning, particularly in the light of recent breakthroughs achieved by large-scale pre-trained models such as GPT-3, CLIP and DALL-e. Accurately predicting the neural network performance with increasing resources such as data, compute and model size provides a more comprehensive evaluation of different approaches across multiple scales, as opposed to traditional point-wise comparisons of fixed-size models on fixed-size benchmarks, and, most importantly, allows for focus on the best-scaling, and thus most promising in the future, approaches. In this work, we consider a challenging problem of few-shot learning in image classification, especially when the target data distribution in the few-shot phase is different from the source, training, data distribution, in a sense that it includes new image classes not encountered during training. Our current main goal is to investigate how the amount of pre-training data affects the few-shot generalization performance of standard image classifiers. Our key observations are that (1) such performance improvements are well-approximated by power laws (linear log-log plots) as the training set size increases, (2) this applies to both cases of target data coming from either the same or from a different domain (i.e., new classes) as the training data, and (3) few-shot performance on new classes converges at a faster rate than the standard classification performance on previously seen classes. Our findings shed new light on the relationship between scale and generalization.

Paper Structure

This paper contains 17 sections, 1 equation, 17 figures, 1 table.

Figures (17)

  • Figure 1: ImageNet one-shot performance averaged over all target datasets: Aircraft, Bird, COCO, Describable Texture, Flower, Fungi, Omniglot, Quickdraw and Traffic Sign. (Left) Fine-tuning performance, (Center) Matching Network performance and (Right) Prototypical Network performance. The one-shot performance scales with the training set size following simple power laws.
  • Figure 2: Comparison of the standard classification performance on classes seen during training (in-distribution performance) and the few-shot performance. Standard classification performance is on the ImageNet test set for classes seen during training. Few-shot performance is the one-shot performance averaged over all target datasets: Aircraft Bird, COCO, Describable Texture, Flower, Fungi, Omniglot, Quickdraw and Traffic Sign.
  • Figure 3: Few-shot performance for multiple train-target pairs of natural image datasets. For each plot, the train dataset is written on the left and the target dataset on top. X-axis is the percentage of the total training data and y-axis is the 5-way 5-shot accuracy. For 5-way 1-shot accuracy, see section \ref{['sec:appendix_full_results']}. Both 5-way 5-shot and 1-shot follow similar trends.
  • Figure 4: (Top row) models trained on the dataset marked on top of each plot and evaluated on Omniglot. (Bottom row) models trained on Omniglot and evaluated on the dataset marked on top of each respective column. X-axis is the percentage of the total training data and y-axis is the 5-way 5-shot accuracy. For 5-way 1-shot accuracy, see section \ref{['sec:appendix_full_results']}. Both 5-way 5-shot and 1-shot follow similar trends.
  • Figure 5: Scaling number of training classes results for various train-target pairs. For each plot, models are trained on the dataset marked on the left of each respective row and evaluated on the dataset marked top of each respective column. X-axis is the percentage of the total number of training classes of each dataset and y-axis is the 5-way 5-shot accuracy. For 5-way 1-shot results, see section \ref{['sec:appendix_full_results']}. Both 5-way 5-shot and 1-shot follow similar trends.
  • ...and 12 more figures