Table of Contents
Fetching ...

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Alexandre Eymaël, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR

This work addresses the high data and compute demands of self-supervised visual pre-training, especially for masked image modeling, by proposing CropMAE, a two-crop siamese pre-training approach that uses image crops instead of video frames. CropMAE trains a shared-weight ViT encoder with a high masking regime on the second crop and a transformer decoder to reconstruct the cropped view, enabling rapid learning of object-centric representations without motion. Empirically, CropMAE achieves competitive or superior performance to state-of-the-art methods on three video-propagation downstream tasks after 400 pre-training epochs, and it learns faster when trained on image collections, reducing dependence on large video datasets. The model demonstrates strong attention to object boundaries and efficiency gains, highlighting the practical impact of image-based, crop-driven pre-training for scalable self-supervised representation learning.

Abstract

Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training and learning time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn such representations from explicit object motion, but rather thanks to the implicit image transformations that occur between the two views. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

TL;DR

This work addresses the high data and compute demands of self-supervised visual pre-training, especially for masked image modeling, by proposing CropMAE, a two-crop siamese pre-training approach that uses image crops instead of video frames. CropMAE trains a shared-weight ViT encoder with a high masking regime on the second crop and a transformer decoder to reconstruct the cropped view, enabling rapid learning of object-centric representations without motion. Empirically, CropMAE achieves competitive or superior performance to state-of-the-art methods on three video-propagation downstream tasks after 400 pre-training epochs, and it learns faster when trained on image collections, reducing dependence on large video datasets. The model demonstrates strong attention to object boundaries and efficiency gains, highlighting the practical impact of image-based, crop-driven pre-training for scalable self-supervised representation learning.

Abstract

Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training and learning time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn such representations from explicit object motion, but rather thanks to the implicit image transformations that occur between the two views. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.
Paper Structure (32 sections, 7 figures, 5 tables)

This paper contains 32 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: CropMAE self-supervised pre-training. Given an input image ($V_1$), a second image is generated by performing a random crop and, optionally, a horizontal flip on the original image ($V_2$). We then patchify Dosovitskiy2021ViT both views and mask Kenton2019BertHe2022Masked an extremely high portion of the second image (above 98.5%). Both views are encoded by a Siamese Bromley1993Siamese ViT encoder, with added positional embedding Dosovitskiy2021ViT. A transformer Girdhar2017Attention decoder reconstructs the masked image $R$ using self-attention layers on the tokens of the masked image and cross-attention layers between the tokens of the masked and unmasked images.
  • Figure 2: Illustration of our four cropping strategies. For a given input image $I$, we generate an unmasked view $V_1$ and a masked view $V_2$ following one of four different cropping strategies: (a) Same Views, where $V_1 = V_2$; (b) Random Views, where $V_1$ and $V_2$ are two independent random crops; (c) Local-to-Global, where $V_1$ is a random crop within $V_2$, and (d) Global-to-Local, where $V_2$ is a random crop within $V_1$.
  • Figure 3: Reconstructions of CropMAE. We train CropMAE with a ViT-S/16 without normalizing pixel values and a masking ratio of 98.5%. We visualize the reconstructions of some images from ImageNet. The images are displayed in the following order from top to bottom: Input Image ($V_1$), Random Resized Crop ($V_2$), Masked Image ($M$), and Reconstruction ($R$).
  • Figure 4: Qualitative results. We train CropMAE with a ViT-S/16 and qualitatively validate our results on three propagation downstream tasks: video object segmentation (DAVIS-2017 PontTuset2017Davis-arxiv), semantic part propagation Zhou2018Adaptive, and human pose propagation (JHMDB Jhuang2013Towards).
  • Figure 5: Self-attention maps from CropMAE with a ViT-S/8 trained on our ImageNet subset. We visualize the self-attention of the [CLS] token from a selected head in the last encoder layer of a ViT-S/8, which was trained on our ImageNet subset without using any supervision to learn this specific token. These self-attention maps reveal that our model can learn object boundaries without the need for prior motion information during pre-training.
  • ...and 2 more figures