Table of Contents
Fetching ...

On Data Scaling in Masked Image Modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu

TL;DR

This work systematically interrogates whether masked image modeling scales with data and model size. Using SimMIM with SwinV2 encoders across a wide range of data fractions, training lengths, and model sizes, the authors show that MIM benefits from more data only when training length is sufficient, and that very large models can overfit on limited data. They demonstrate a strong link between pre-training validation loss and downstream fine-tuning performance, suggesting validation loss as a practical proxy for evaluating pre-trained representations. The findings establish that MIM is both model- and data-scalable, and provide actionable guidance for scaling MIM pre-training and for efficient model selection without excessive downstream evaluation.

Abstract

An important goal of self-supervised learning is to enable model pre-training to benefit from almost unlimited data. However, one method that has recently become popular, namely masked image modeling (MIM), is suspected to be unable to benefit from larger data. In this work, we break this misconception through extensive experiments, with data scales ranging from 10\% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations. Our study reveals that: (i) Masked image modeling is also demanding on larger data. We observed that very large models got over-fitted with relatively small data; (ii) The length of training matters. Large models trained with masked image modeling can benefit from more data with longer training; (iii) The validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks. This observation allows us to pre-evaluate pre-trained models in advance without having to make costly trial-and-error assessments of downstream tasks. We hope that our findings will advance the understanding of masked image modeling in terms of scaling ability.

On Data Scaling in Masked Image Modeling

TL;DR

This work systematically interrogates whether masked image modeling scales with data and model size. Using SimMIM with SwinV2 encoders across a wide range of data fractions, training lengths, and model sizes, the authors show that MIM benefits from more data only when training length is sufficient, and that very large models can overfit on limited data. They demonstrate a strong link between pre-training validation loss and downstream fine-tuning performance, suggesting validation loss as a practical proxy for evaluating pre-trained representations. The findings establish that MIM is both model- and data-scalable, and provide actionable guidance for scaling MIM pre-training and for efficient model selection without excessive downstream evaluation.

Abstract

An important goal of self-supervised learning is to enable model pre-training to benefit from almost unlimited data. However, one method that has recently become popular, namely masked image modeling (MIM), is suspected to be unable to benefit from larger data. In this work, we break this misconception through extensive experiments, with data scales ranging from 10\% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations. Our study reveals that: (i) Masked image modeling is also demanding on larger data. We observed that very large models got over-fitted with relatively small data; (ii) The length of training matters. Large models trained with masked image modeling can benefit from more data with longer training; (iii) The validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks. This observation allows us to pre-evaluate pre-trained models in advance without having to make costly trial-and-error assessments of downstream tasks. We hope that our findings will advance the understanding of masked image modeling in terms of scaling ability.
Paper Structure (27 sections, 8 figures, 14 tables)

This paper contains 27 sections, 8 figures, 14 tables.

Figures (8)

  • Figure 1: The curves of training loss, validation loss of pre-training, and fine-tuning accuracy on ImageNet-1K of different model sizes, data sizes and training lengths, w.r.t. the relative training cost. We set the training cost of SwinV2-S for 125K iterations as the value of 1. Bigger circles indicate larger models. Best viewed in color.
  • Figure 2: Relationship among training loss, validation loss of pre-training, and fine-tuning performance of ImageNet-1K measured by top-1 accuracy, w.r.t. the training length. Best viewed in color.
  • Figure 3: The curves of performances on COCO object detection (a), COCO instance segmentation (b), iNaturalist-18 (c), and ADE20K semantic segmentation (d) w.r.t. the relative training cost. Note that the training cost indicates the pre-training cost. We set the training cost of SwinV2-S for 125K iterations as 1. Bigger circles indicate larger models. Best viewed in color.
  • Figure 4: The correlations between pre-training losses (training and validation losses) and the fine-tuning performances. (a) ImageNet-1K image classification; (b) iNat 2018 fine-grained classification; (c) COCO object detection; (d) COCO instance segmentation; (e) ADE20K semantic segmentation. Pre-training losses are highly correlated with fine-tuning performance on all tasks. Red circles are the overfitting models and green circles are non-overfitting models. Best viewed in color.
  • Figure 5: We visualize the reconstruction results of overfitting model (SwinV2-L pre-trained on ImageNet-1K(10%)) and non-overfitting model (SwinV2-L pre-trained on ImageNet-1K(100%)). (a) shows the reconstruction results on the training images from ImageNet-1K(10%) dataset, which are jointly contained by the training set of two models. (b) shows the reconstruction results on the validation images from ImageNet-1K validation set. Each group contains 4 images from left to right are: the original image, the corrupted images, reconstructed image of overfitting model, and reconstructed image of non-overfitting model.
  • ...and 3 more figures