Table of Contents
Fetching ...

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

TL;DR

To address the heavy reliance on large supervised data for foundation vision models, this paper introduces a MAE-based pre-pretraining stage that initializes vision transformers before standard weakly supervised pretraining. The main finding is that MAE scales with both model size and training data size, and that MAE→WSP improves convergence and downstream transfer across tasks, including image classification, video understanding, detection, and zero-/low-shot settings. The work reports strong results across 10 tasks, with notable state-of-the-art outcomes on iNaturalist-18, ImageNet-ReaL, 1-shot ImageNet-1k, and zero-shot Food-101, demonstrating the practical impact of improved initialization at web-scale. Overall, the approach is simple, scalable, and effectively combines self-supervised and weakly supervised signals for billion-scale pretraining, indicating that initialization plays a significant role even under massive supervision and data.

Abstract

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.

The effectiveness of MAE pre-pretraining for billion-scale pretraining

TL;DR

To address the heavy reliance on large supervised data for foundation vision models, this paper introduces a MAE-based pre-pretraining stage that initializes vision transformers before standard weakly supervised pretraining. The main finding is that MAE scales with both model size and training data size, and that MAE→WSP improves convergence and downstream transfer across tasks, including image classification, video understanding, detection, and zero-/low-shot settings. The work reports strong results across 10 tasks, with notable state-of-the-art outcomes on iNaturalist-18, ImageNet-ReaL, 1-shot ImageNet-1k, and zero-shot Food-101, demonstrating the practical impact of improved initialization at web-scale. Overall, the approach is simple, scalable, and effectively combines self-supervised and weakly supervised signals for billion-scale pretraining, indicating that initialization plays a significant role even under massive supervision and data.

Abstract

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.
Paper Structure (12 sections, 8 figures, 24 tables)

This paper contains 12 sections, 8 figures, 24 tables.

Figures (8)

  • Figure 1: MAE pre-pretraining improves performance. Transfer performance of a ViT-L architecture trained with self-supervised pretraining (MAE), weakly supervised pretraining on billions of images (WSP), and our pre-pretraining (MAE$\rightarrow$WSP) that initializes the model with MAE and then pretrains with WSP. Pre-pretraining consistently improves performance.
  • Figure 2: Scaling MAE with model and dataset size. We plot MAE's performance when pretrained on ImageNet-1k or Instagram-3B and finetuned on downstream tasks. MAE scales to billion parameters sized models using just IN1k pretraining. Larger models show improved scaling behavior when pretrained with the much larger IG-3B dataset. Tabulated results in Appendix \ref{['tab:mae_scaling_numbers']}. IN1k and iNat18 results are finetuned at 224px resolution. For COCO and LVIS, MAE pretrained on IN1k for ViT-2B is missing as training at that scale was unstable, and ViT-6.5B results are skipped due to compute limitations.
  • Figure 3: MAE pre-pretraining scales with model size. Across model sizes, MAE$\rightarrow$WSP outperforms a WSP only model, and shows strong scaling behavior. Most notably, a 2B MAE$\rightarrow$WSP model outperforms a 6.5B WSP model.
  • Figure 4: Varying the number of pre-pretraining epochs used to initialize the model for WSP pretraining. Pre-pretraining leads to improved convergence, providing higher performance using fewer number of WSP pretraining epochs.
  • Figure 5: MAE$\rightarrow$WSP is more FLOPs efficient than WSP. Across a wide range of training FLOP profiles for a ViT-B computed by varying WSP (and MAE) epochs, MAE$\rightarrow$WSP outperforms a baseline WSP only model.
  • ...and 3 more figures