R-MAE: Regions Meet Masked Autoencoders
Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen
TL;DR
This work introduces masked Region Autoencoding (RAE) and its integration with Masked Autoencoding (MAE), resulting in R-MAE, a region-aware pretraining framework. By representing regions as binary region maps and treating regions as queries, R-MAE addresses the one-to-many mapping challenge and preserves permutation equivariance while leveraging pixel features to help region reconstruction. Across COCO, LVIS, and ADE20K, R-MAE improves downstream object detection, instance segmentation, and semantic segmentation over MAE, with negligible computational overhead, and benefits from high-quality region sources such as SAM. Additionally, R-MAE enables interactive segmentation demonstrations, suggesting a pathway toward promptable region-level reasoning in vision, and shows robustness across region sources and data scales. These results underscore the value of incorporating region-level structure into self-supervised learning for dense vision tasks.
Abstract
In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at https://github.com/facebookresearch/r-mae.
