Table of Contents
Fetching ...

R-MAE: Regions Meet Masked Autoencoders

Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen

TL;DR

This work introduces masked Region Autoencoding (RAE) and its integration with Masked Autoencoding (MAE), resulting in R-MAE, a region-aware pretraining framework. By representing regions as binary region maps and treating regions as queries, R-MAE addresses the one-to-many mapping challenge and preserves permutation equivariance while leveraging pixel features to help region reconstruction. Across COCO, LVIS, and ADE20K, R-MAE improves downstream object detection, instance segmentation, and semantic segmentation over MAE, with negligible computational overhead, and benefits from high-quality region sources such as SAM. Additionally, R-MAE enables interactive segmentation demonstrations, suggesting a pathway toward promptable region-level reasoning in vision, and shows robustness across region sources and data scales. These results underscore the value of incorporating region-level structure into self-supervised learning for dense vision tasks.

Abstract

In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at https://github.com/facebookresearch/r-mae.

R-MAE: Regions Meet Masked Autoencoders

TL;DR

This work introduces masked Region Autoencoding (RAE) and its integration with Masked Autoencoding (MAE), resulting in R-MAE, a region-aware pretraining framework. By representing regions as binary region maps and treating regions as queries, R-MAE addresses the one-to-many mapping challenge and preserves permutation equivariance while leveraging pixel features to help region reconstruction. Across COCO, LVIS, and ADE20K, R-MAE improves downstream object detection, instance segmentation, and semantic segmentation over MAE, with negligible computational overhead, and benefits from high-quality region sources such as SAM. Additionally, R-MAE enables interactive segmentation demonstrations, suggesting a pathway toward promptable region-level reasoning in vision, and shows robustness across region sources and data scales. These results underscore the value of incorporating region-level structure into self-supervised learning for dense vision tasks.

Abstract

In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at https://github.com/facebookresearch/r-mae.
Paper Structure (45 sections, 9 equations, 8 figures, 10 tables)

This paper contains 45 sections, 9 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Region-Aware Masked Autoencoder (R-MAE). The masked region autoencoding as a standalone task learns to reconstruct multiple region maps in parallel given visible region and image patches. The region encoder generates region embeddings by pooling features from visible region patches. The region decoder then takes region embeddings and decodes them into region maps using image features from the pixel encoder. By treating regions as queries, it effectively balances speed and accuracy. The design of our architecture allows its integration with pixel reconstruction in MAE (de-highlighted).
  • Figure 2: The region query is spatially expanded in the length variant. We modify the standard cross-attention layer nicolas2020detr (left). Given a region query, it is summed with all value vectors to expand its spatial axes (right). A small MLP head is attached afterwards. This design enables the reconstruction of region maps from the region queries efficiently.
  • Figure 3: Attention maps from a Vision Transformer pre-trained with R-MAE. In each group from left to right we show the original image with the selected query (denoted by red square); three attention maps corresponding to the query generated from i) MoCo v3; ii) MAE; and iii) R-MAE. All methods are pre-trained on COCO train2017. In every row from top to bottom, we show 3 types of the query: i) rigid objects, ii) non-rigid objects, iii) multiple objects. Regions with darker red colors in the attention map denote larger attention weights. Compared to the baselines, the attention map from R-MAE is more instance-aware.
  • Figure 4: Masking strategy in R-MAE. Mask ratio matters -- we either change the region mask ratio ($\beta_\text{R}$) alone (above), or jointly change it with the image mask ratio ($\beta_\text{R}{=}\beta_\text{I}$, bottom). In both cases, a high mask ratio ($\hbox{$\sim$}$0.75) is required.
  • Figure 5: Qualitative results on COCO val2017 images, using R-MAE pre-trained with unsupervised region maps pedro2004fh, and then applied on either COCO ground-truth regions (left column) or FH regions used during pre-training (right column). The image group contains 1) the masked image, 2) the image reconstruction, 3) the original image. The region group has 1) the masked region, 2) the region reconstruction, 3) the original region, 4) regions in the corresponding image. Besides results, the figure also gives a sense of the differences between ground-truths and regions used in R-MAE. Surprisingly, the algorithm pre-trained with FH regions can generalize well to ground-truth ones.
  • ...and 3 more figures