Table of Contents
Fetching ...

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Jun Chen, Ming Hu, Boyang Li, Mohamed Elhoseiny

TL;DR

LoMaR tackles the computational bottleneck of global masked reconstruction in generative vision SSL by restricting self-attention to local $K\times K$ windows (notably $7\times7$) and using a lightweight encoder with an MLP head. This local masked reconstruction approach yields substantial efficiency gains—up to several-fold speedups in high-resolution pretraining—while achieving comparable or better accuracy on ImageNet-1K and strong improvements on downstream tasks like COCO object detection and ADE20K semantic segmentation. The method generalizes beyond MAE, as demonstrated by integrating LoMaR into BEiT with notable accuracy and training-time reductions. Overall, LoMaR offers a scalable, efficient paradigm for self-supervised visual pretraining, with potential extensions to video and large-scale datasets.

Abstract

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7$\times$7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384$\times$384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1$\times$ faster than MAE with 0.2% higher classification accuracy on pretraining 448$\times$448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in https://github.com/junchen14/LoMaR.

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

TL;DR

LoMaR tackles the computational bottleneck of global masked reconstruction in generative vision SSL by restricting self-attention to local windows (notably ) and using a lightweight encoder with an MLP head. This local masked reconstruction approach yields substantial efficiency gains—up to several-fold speedups in high-resolution pretraining—while achieving comparable or better accuracy on ImageNet-1K and strong improvements on downstream tasks like COCO object detection and ADE20K semantic segmentation. The method generalizes beyond MAE, as demonstrated by integrating LoMaR into BEiT with notable accuracy and training-time reductions. Overall, LoMaR offers a scalable, efficient paradigm for self-supervised visual pretraining, with potential extensions to video and large-scale datasets.

Abstract

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 77 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 on object detection and 0.5 on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1 faster than MAE with 0.2% higher classification accuracy on pretraining 448448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in https://github.com/junchen14/LoMaR.
Paper Structure (20 sections, 9 figures, 7 tables)

This paper contains 20 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: We visualize the attention patterns employed by $\text{MAE}_{\text{Large}}$mae in the reconstruction of a random target patch, indicated by orange. Patches that are important for prediction are usually close to the target patch. We selected the images randomly from the ImageNet-1K imagenet Val set.
  • Figure 2: Contrasting the masking and reconstruction strategy between MAE and LoMaR. During the pretraining, MAE randomly masks 75% patches as masking and reconstructs them by attending to the remaining visible patches. For LoMaR, it randomly samples several small regions and masks a random subset of patches from each region, e.g. 80%. The masked patches will only attend to the visible patches inside each region for reconstruction. In contrast to MAE, LoMaR usually samples less visible patches per image.
  • Figure 3: Computational efficiency evaluation: We compute their ImageNet-1K top-1 accuracy per pretraining time for low-resolution images 224$\times$224.
  • Figure 4: Comparison between LoMaR simple encoder and MAE asymmetric encoder-decoder architectures on our random window masking strategy. The window sizes vary from 14$\times$14 to 5$\times$5.
  • Figure 5: Example results on ImageNet (upper two rows) and COCO (lower two rows) validation images. We mask 80% patches out and reconstruct them with our pretrained model. For each image reconstruction figure, we split them into 4 parts: 1) the left-most is the original image. 2) the second-left is the sampled window (7$\times$7 patches). 3) The second-right is the masked image. 4) The right-most is our reconstructed image.
  • ...and 4 more figures