Table of Contents
Fetching ...

ProMerge: Prompt and Merge for Unsupervised Instance Segmentation

Dylan Li, Gyungin Shin

Abstract

Unsupervised instance segmentation aims to segment distinct object instances in an image without relying on human-labeled data. This field has recently seen significant advancements, partly due to the strong local correspondences afforded by rich visual feature representations from self-supervised models (e.g., DINO). Recent state-of-the-art approaches use self-supervised features to represent images as graphs and solve a generalized eigenvalue system (i.e., normalized-cut) to generate foreground masks. While effective, this strategy is limited by its attendant computational demands, leading to slow inference speeds. In this paper, we propose Prompt and Merge (ProMerge), which leverages self-supervised visual features to obtain initial groupings of patches and applies a strategic merging to these segments, aided by a sophisticated background-based mask pruning technique. ProMerge not only yields competitive results but also offers a significant reduction in inference time compared to state-of-the-art normalized-cut-based approaches. Furthermore, when training an object detector using our mask predictions as pseudo-labels, the resulting detector surpasses the current leading unsupervised model on various challenging instance segmentation benchmarks.

ProMerge: Prompt and Merge for Unsupervised Instance Segmentation

Abstract

Unsupervised instance segmentation aims to segment distinct object instances in an image without relying on human-labeled data. This field has recently seen significant advancements, partly due to the strong local correspondences afforded by rich visual feature representations from self-supervised models (e.g., DINO). Recent state-of-the-art approaches use self-supervised features to represent images as graphs and solve a generalized eigenvalue system (i.e., normalized-cut) to generate foreground masks. While effective, this strategy is limited by its attendant computational demands, leading to slow inference speeds. In this paper, we propose Prompt and Merge (ProMerge), which leverages self-supervised visual features to obtain initial groupings of patches and applies a strategic merging to these segments, aided by a sophisticated background-based mask pruning technique. ProMerge not only yields competitive results but also offers a significant reduction in inference time compared to state-of-the-art normalized-cut-based approaches. Furthermore, when training an object detector using our mask predictions as pseudo-labels, the resulting detector surpasses the current leading unsupervised model on various challenging instance segmentation benchmarks.
Paper Structure (22 sections, 2 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 2 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Qualitative examples of ProMerge, a simple yet effective training-free approach for unsupervised instance segmentation. Despite its simplicity, ProMerge demonstrates strong segmentation performance.
  • Figure 2: An overview of ProMerge. Given an input image, we obtain initial mask proposals by prompting visual features from an image encoder using a 2D point grid. Then, the noisy proposals are filtered through the proposed background-based mask pruning. The final predictions are made by iteratively merging the remaining foreground masks.
  • Figure 2: Speed comparison. Our method is approximately 3.6 times faster in FPS compared to MaskCut.
  • Figure 3: Qualitative examples of the pixel-wise voting. For each case, an input image, background candidate masks (only three masks are shown for visual purposes), and the voted mask are visualized. The voted background mask, $\Tilde{\mathbf{M}}^\text{bg}$ effectively filters out the background, leaving only foreground regions despite the noisy candidate masks.
  • Figure 4: An illustration of the proposed Cascade mask filtering process. For each iteration of the proposed method, we evaluate the newly proposed mask by focusing on the pixels that have not yet been covered by the cumulative foreground mask, which is an aggregation of pixels from mask proposals in preceding iterations. If these previously unseen pixels demonstrate a significant overlap with the background mask, quantified by the Intersection-over-Area (IoA) metric, the mask proposal for that iteration is subsequently disregarded. An example of this can be observed in the fourth iteration (rightmost), in which the mask proposal is eliminated due to its high IoA with the background mask. Note that in the figure, the feature similarity condition is not shown for visual clarity. See the text for details.
  • ...and 5 more figures