Table of Contents
Fetching ...

ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models

Ying Zhang, Maoliang Yin, Wenfu Bi, Haibao Yan, Shaohan Bian, Cui-Hua Zhang, Changchun Hua

TL;DR

The paper tackles zero-shot unseen object instance segmentation for indoor service robots, addressing data annotation and sim-to-real gaps. It introduces ZISVFM, a three-stage framework that fuses the Segment Anything Model (SAM) with explicit representations from a self-supervised Vision Transformer (DINOv2) and uses K-Medoids-derived point prompts to refine masks without task-specific training. The approach processes colorized-depth inputs to generate object proposals, filters non-object masks via attention-entropy weighting, and refines segmentation with cluster-center prompts, achieving strong results on OCID, OSD, and a new HIOD dataset that emphasizes hierarchical indoor scenes, including cabinets and drawers. Practical robotic validation on a Fetch platform demonstrates the method's applicability to grasping unknown objects in cluttered environments, with future work focusing on domain adaptation and multi-view strategies to further bridge the sim-to-real gap.

Abstract

Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at https://github.com/Yinmlmaoliang/zisvfm.

ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models

TL;DR

The paper tackles zero-shot unseen object instance segmentation for indoor service robots, addressing data annotation and sim-to-real gaps. It introduces ZISVFM, a three-stage framework that fuses the Segment Anything Model (SAM) with explicit representations from a self-supervised Vision Transformer (DINOv2) and uses K-Medoids-derived point prompts to refine masks without task-specific training. The approach processes colorized-depth inputs to generate object proposals, filters non-object masks via attention-entropy weighting, and refines segmentation with cluster-center prompts, achieving strong results on OCID, OSD, and a new HIOD dataset that emphasizes hierarchical indoor scenes, including cabinets and drawers. Practical robotic validation on a Fetch platform demonstrates the method's applicability to grasping unknown objects in cluttered environments, with future work focusing on domain adaptation and multi-view strategies to further bridge the sim-to-real gap.

Abstract

Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at https://github.com/Yinmlmaoliang/zisvfm.

Paper Structure

This paper contains 32 sections, 6 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed ZISVFM methodology. This approach employs two vision foundation models: SAM kirillov2023segment for segmentation and ViT trained with DINOv2 oquab2023dinov2 for feature description in a scene. The process consists of three main stages: 1) Generating object-agnostic mask proposals using SAM on colorized depth images; 2) Refinement of object masks by removing non-object masks based on explicit visual representations from a self-supervised ViT; 3) Point prompts derived from clustering centres within each object's proposal further optimise object segmentation performance.
  • Figure 2: Visualization of self-attention maps obtained with the six different heads in the last attention layer from ViT.
  • Figure 3: Comparison of ZISVFM with baseline, SOTA methods on OCID and HIOD datasets. The baseline method, SAM, utilized RGB and depth images as inputs. The SOTA methods contain two representative UOIS methods, UOIS-Net-3D and MSMFormer, with the latter incorporating a zoom-in cluster refinement operation. In comparison to all baseline and advanced methods, our proposed ZISVFM demonstrated the capability to provide clear and precise masks in the hierarchical scenes of HIOD.
  • Figure 4: Ablation studies examining the sensitivity of model performance metrics relative to threshold parameter $\tau$ for both object and boundary detection.
  • Figure 5: Common failure cases of our proposed method in OCIDsuchi2019easylabel and OSDrichtsfeld2012segmentation datasets. The segmentation results are not refined using point prompts.
  • ...and 1 more figures