Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

Yangxiao Lu; Jishnu Jaykumar P; Yunhui Guo; Nicholas Ruozzi; Yu Xiang

Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

Yangxiao Lu, Jishnu Jaykumar P, Yunhui Guo, Nicholas Ruozzi, Yu Xiang

TL;DR

To tackle novel instance detection and segmentation (NIDS), the paper proposes NIDS-Net, which leverages Grounded-SAM to generate high-quality proposals and introduces a Weight Adapter to refine embeddings learned from a frozen DINOv2 backbone. Embeddings are formed via Foreground Feature Averaging on patch features, yielding $E_T \in \mathbb{R}^{N \times K \times C}$ for templates and $E_P \in \mathbb{R}^{Q \times C}$ for proposals, with refinement guided by an InfoNCE objective. Matching relies on cosine similarity in the adapted space, optionally augmented by an appearance score and resolved by stable matching to assign unique instance IDs, producing precise detections and segmentations. The approach achieves substantial performance gains across four detection datasets and seven BOP segmentation datasets and demonstrates real-world applicability on robotic platforms, all while avoiding end-to-end retraining of large backbones. Overall, the work shows how to effectively repurpose pre-trained vision models for NIDS via a lightweight, generalizable adapter that enhances embedding discriminability without overfitting.

Abstract

Novel Instance Detection and Segmentation (NIDS) aims at detecting and segmenting novel object instances given a few examples of each instance. We propose a unified, simple, yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment. Leveraging recent advancements in large vision methods, we utilize Grounding DINO and Segment Anything Model (SAM) to obtain object proposals with accurate bounding boxes and masks. Central to our approach is the generation of high-quality instance embeddings. We utilized foreground feature averages of patch embeddings from the DINOv2 ViT backbone, followed by refinement through a weight adapter mechanism that we introduce. We show experimentally that our weight adapter can adjust the embeddings locally within their feature space and effectively limit overfitting in the few-shot setting. Furthermore, the weight adapter optimizes weights to enhance the distinctiveness of instance embeddings during similarity computation. This methodology enables a straightforward matching strategy that results in significant performance gains. Our framework surpasses current state-of-the-art methods, demonstrating notable improvements in four detection datasets. In the segmentation tasks on seven core datasets of the BOP challenge, our method outperforms the leading published RGB methods and remains competitive with the best RGB-D method. We have also verified our method using real-world images from a Fetch robot and a RealSense camera. Project Page: https://irvlutd.github.io/NIDSNet/

Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

TL;DR

for templates and

for proposals, with refinement guided by an InfoNCE objective. Matching relies on cosine similarity in the adapted space, optionally augmented by an appearance score and resolved by stable matching to assign unique instance IDs, producing precise detections and segmentations. The approach achieves substantial performance gains across four detection datasets and seven BOP segmentation datasets and demonstrates real-world applicability on robotic platforms, all while avoiding end-to-end retraining of large backbones. Overall, the work shows how to effectively repurpose pre-trained vision models for NIDS via a lightweight, generalizable adapter that enhances embedding discriminability without overfitting.

Abstract

Paper Structure (21 sections, 5 equations, 15 figures, 14 tables)

This paper contains 21 sections, 5 equations, 15 figures, 14 tables.

Introduction
Related Work
Method
Instance Embedding Generation Stage
Object Proposal Stage
Embedding Refinement via an Adapter
Matching Stage
Experiments
Detection Datasets
Segmentation Datasets
Benchmarking Results
Real-world Testing and Failure Cases
Ablation Study
Discussions
Training Details
...and 6 more sections

Figures (15)

Figure 1: We leverage pre-trained vision models for object proposal generation and feature extraction, and introduce a weight adapter to improve pre-trained feature embeddings for novel object instance detection and segmentation.
Figure 2: In our framework NIDS-Net, only adapters are learnable, while other models are frozen. Instance IDs are the instance labels.
Figure 3: (Left) CLIP-Adapter gao2021clipAdapter (Right) Our introduced weight adapter
Figure 4: Visual results on the RoboTools benchmark.
Figure 5: Comparison of segmentation results using CNOS, SAM6D, and NIDS-Net on the YCB-V dataset. CNOS and SAM6D may misclassify some background regions or object parts as objects due to proposal generation of SAM. Red arrows indicate these mistakes. NIDS-Net addresses this limitation with Grounded-SAM.
...and 10 more figures

Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

TL;DR

Abstract

Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)