Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction
Jun Chen, Ming Hu, Boyang Li, Mohamed Elhoseiny
TL;DR
LoMaR tackles the computational bottleneck of global masked reconstruction in generative vision SSL by restricting self-attention to local $K\times K$ windows (notably $7\times7$) and using a lightweight encoder with an MLP head. This local masked reconstruction approach yields substantial efficiency gains—up to several-fold speedups in high-resolution pretraining—while achieving comparable or better accuracy on ImageNet-1K and strong improvements on downstream tasks like COCO object detection and ADE20K semantic segmentation. The method generalizes beyond MAE, as demonstrated by integrating LoMaR into BEiT with notable accuracy and training-time reductions. Overall, LoMaR offers a scalable, efficient paradigm for self-supervised visual pretraining, with potential extensions to video and large-scale datasets.
Abstract
Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7$\times$7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384$\times$384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1$\times$ faster than MAE with 0.2% higher classification accuracy on pretraining 448$\times$448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in https://github.com/junchen14/LoMaR.
