On Data Scaling in Masked Image Modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu
TL;DR
This work systematically interrogates whether masked image modeling scales with data and model size. Using SimMIM with SwinV2 encoders across a wide range of data fractions, training lengths, and model sizes, the authors show that MIM benefits from more data only when training length is sufficient, and that very large models can overfit on limited data. They demonstrate a strong link between pre-training validation loss and downstream fine-tuning performance, suggesting validation loss as a practical proxy for evaluating pre-trained representations. The findings establish that MIM is both model- and data-scalable, and provide actionable guidance for scaling MIM pre-training and for efficient model selection without excessive downstream evaluation.
Abstract
An important goal of self-supervised learning is to enable model pre-training to benefit from almost unlimited data. However, one method that has recently become popular, namely masked image modeling (MIM), is suspected to be unable to benefit from larger data. In this work, we break this misconception through extensive experiments, with data scales ranging from 10\% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations. Our study reveals that: (i) Masked image modeling is also demanding on larger data. We observed that very large models got over-fitted with relatively small data; (ii) The length of training matters. Large models trained with masked image modeling can benefit from more data with longer training; (iii) The validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks. This observation allows us to pre-evaluate pre-trained models in advance without having to make costly trial-and-error assessments of downstream tasks. We hope that our findings will advance the understanding of masked image modeling in terms of scaling ability.
