Table of Contents
Fetching ...

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

Qifan Zhang, Sai Haneesh Allu, Jikai Wang, Yangxiao Lu, Yu Xiang

TL;DR

L2G-Det is proposed, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image to prompt an augmented Segment Anything Model with instance-specific object tokens.

Abstract

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

TL;DR

L2G-Det is proposed, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image to prompt an augmented Segment Anything Model with instance-specific object tokens.

Abstract

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.
Paper Structure (23 sections, 16 equations, 9 figures, 8 tables)

This paper contains 23 sections, 16 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Conceptual comparison between object proposal-based instance detection methods and our local-to-global instance detection framework. Top: Proposal-based approaches lu2024adapting first generate object proposals in the query image and then perform instance matching to obtain detection results. Bottom: Our method starts from dense local correspondences to identify candidate points and reconstructs complete instance masks through mask reconstruction, producing final detection results without explicit proposal generation.
  • Figure 2: Overview of our L2G-Det framework for novel instance detection. It consists of a candidate selection module and an augmented SAM module (SAM$^{*}$). Only the adapters and object-tokens are learnable, while all other components are frozen.
  • Figure 3: Comparison between the original SAM and our augmented SAM$^{*}$ under candidate points. With all parameters frozen, SAM tends to produce incomplete masks focused on local regions around the prompts. In contrast, SAM$^{*}$ incorporates a learnable instance-specific object token, which guides the decoder to produce more complete object masks improving detection performance.
  • Figure 4: Qualitative results on RoboTools benchmark. From left to right, we show the ground-truth annotations, results produced by NIDS-Net lu2024adapting, and results of our method L2G-Det.
  • Figure 5: Effect of the number of template images $K$.
  • ...and 4 more figures