ConvNets Match Vision Transformers at Scale

Samuel L. Smith; Andrew Brock; Leonard Berrada; Soham De

ConvNets Match Vision Transformers at Scale

Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De

TL;DR

After fine- Tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets, and the strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

Abstract

Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

ConvNets Match Vision Transformers at Scale

TL;DR

After fine- Tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets, and the strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

Abstract

Paper Structure (5 sections, 3 figures)

This paper contains 5 sections, 3 figures.

Introduction
Pre-trained NFNets obey scaling laws
Fine-tuned NFNets are competitive with Vision Transformers on ImageNet
Discussion
Acknowledgements

Figures (3)

Figure 1: ImageNet Top-1 error, after fine-tuning pre-trained NFNet models for 50 epochs. Both axes are log-scaled. Performance improves consistently as the compute used during pre-training increases. Our largest model (F7+) achieves comparable performance to that reported for pre-trained ViTs with a similar compute budget alabdulmohsin2023gettingzhai2022scaling. The performance of this model improved further when fine-tuned with repeated augmentation (RA) hoffer2019augment.
Figure 2: Held out loss of NFNets on JFT-4B, plotted against the compute used during training. Both axes are log-scaled, and each curve denotes a different model trained for a range of epoch budgets. We observe a linear trend, matching the scaling laws observed for language modelling.
Figure 3: The optimal learning rate behaves predictably and is easy to tune. All models show similar optimal learning rates $\alpha \sim 1.6$ when the epoch budget is small. The learning rate falls slowly as model size and epoch budget increases.

ConvNets Match Vision Transformers at Scale

TL;DR

Abstract

ConvNets Match Vision Transformers at Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (3)