Table of Contents
Fetching ...

ConvNets Match Vision Transformers at Scale

Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De

TL;DR

After fine- Tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets, and the strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

Abstract

Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

ConvNets Match Vision Transformers at Scale

TL;DR

After fine- Tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets, and the strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

Abstract

Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.
Paper Structure (5 sections, 3 figures)

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: ImageNet Top-1 error, after fine-tuning pre-trained NFNet models for 50 epochs. Both axes are log-scaled. Performance improves consistently as the compute used during pre-training increases. Our largest model (F7+) achieves comparable performance to that reported for pre-trained ViTs with a similar compute budget alabdulmohsin2023gettingzhai2022scaling. The performance of this model improved further when fine-tuned with repeated augmentation (RA) hoffer2019augment.
  • Figure 2: Held out loss of NFNets on JFT-4B, plotted against the compute used during training. Both axes are log-scaled, and each curve denotes a different model trained for a range of epoch budgets. We observe a linear trend, matching the scaling laws observed for language modelling.
  • Figure 3: The optimal learning rate behaves predictably and is easy to tune. All models show similar optimal learning rates $\alpha \sim 1.6$ when the epoch budget is small. The learning rate falls slowly as model size and epoch budget increases.