Table of Contents
Fetching ...

Region-Based Representations Revisited

Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman T, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, Derek Hoiem

TL;DR

This work revisits region-based representations for recognition by pairing class-agnostic segmentation masks from SAM with strong unsupervised patch features such as DINOv2. By upsampling feature maps to image size and pooling within region masks, the authors construct compact region embeddings that support semantic segmentation, object-based image retrieval, multi-view segmentation, and activity classification with simple decoders. Across Pascal VOC, ADE20K, ScanNet, COCO, and Kinetics, region-based representations offer competitive performance and enable efficient aggregation over many images, highlighting their usefulness for customizable queries and multi-image inference. A key limitation is the current speed of SAM, but the approach shows clear potential to improve with faster mask generation and richer region/temporal features.

Abstract

We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.

Region-Based Representations Revisited

TL;DR

This work revisits region-based representations for recognition by pairing class-agnostic segmentation masks from SAM with strong unsupervised patch features such as DINOv2. By upsampling feature maps to image size and pooling within region masks, the authors construct compact region embeddings that support semantic segmentation, object-based image retrieval, multi-view segmentation, and activity classification with simple decoders. Across Pascal VOC, ADE20K, ScanNet, COCO, and Kinetics, region-based representations offer competitive performance and enable efficient aggregation over many images, highlighting their usefulness for customizable queries and multi-image inference. A key limitation is the current speed of SAM, but the approach shows clear potential to improve with faster mask generation and richer region/temporal features.

Abstract

We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.
Paper Structure (21 sections, 10 figures, 16 tables)

This paper contains 21 sections, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Our framework revisits the use of region features for downstream applications. We generate region features by first segmenting an image, extracting image features, then pooling the image features across the region masks.
  • Figure 2: Method overview. We generate masks using class-agnostic segmenters, such as SAM, and patch-based features using strong representations, such as DINOv2. The features are average-pooled in the masks, creating region-based representations, which can then be decoded with linear classifiers or decoders for a variety of tasks.
  • Figure 3: A comparison of region coverage when using SAM and SAM with SLIC. SLIC fills in many of the uncovered regions, leaving few holes.
  • Figure 4: Examples of object retrieval with region representations. The query object is highlighted in the first column. The second column contains the database images, and the third column shows the similarity score between all of the regions in the database image and the query object. Our method matches objects in database images to the query object under different settings.
  • Figure 5: Video Activity Classification Method Overview. By pooling regions across video frames, we can categorize a video using a small fraction of the number of tokens that would be required for patch-based representations.
  • ...and 5 more figures