Region-Based Representations Revisited
Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman T, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, Derek Hoiem
TL;DR
This work revisits region-based representations for recognition by pairing class-agnostic segmentation masks from SAM with strong unsupervised patch features such as DINOv2. By upsampling feature maps to image size and pooling within region masks, the authors construct compact region embeddings that support semantic segmentation, object-based image retrieval, multi-view segmentation, and activity classification with simple decoders. Across Pascal VOC, ADE20K, ScanNet, COCO, and Kinetics, region-based representations offer competitive performance and enable efficient aggregation over many images, highlighting their usefulness for customizable queries and multi-image inference. A key limitation is the current speed of SAM, but the approach shows clear potential to improve with faster mask generation and richer region/temporal features.
Abstract
We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.
