Table of Contents
Fetching ...

SAI3D: Segment Any Instance in 3D Scenes

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, Baoquan Chen

TL;DR

SAI3D is introduced, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Any-thing Model (SAM) that outperforms existing open-vocabulary base-lines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.

Abstract

Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing.Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++. Our project page is at https://yd-yin.github.io/SAI3D.

SAI3D: Segment Any Instance in 3D Scenes

TL;DR

SAI3D is introduced, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Any-thing Model (SAM) that outperforms existing open-vocabulary base-lines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++.

Abstract

Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing.Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++. Our project page is at https://yd-yin.github.io/SAI3D.
Paper Structure (18 sections, 6 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: SAI3D: A Zero-Shot Approach for 3D Instance Segmentation. SAI3D leverages geometric priors and 2D segmentation foundation models to perform training-free zero-shot 3D instance segmentation (top). Our generated 3D masks enable applications of open-vocabulary queries of fine-grained 3D instances (bottom).
  • Figure 2: Method overview. Our approach combines geometric priors with the capabilities of 2D foundation models. We over-segment 3D point clouds into superpoints (top-left), and generate 2D image masks using SAM (bottom-left). We then construct a scene graph that quantifies the pairwise affinity scores of super points (middle). Finally, we leverage a progressive region growing to gradually merge 3D superpoints into the final 3D instance segmentation masks (right).
  • Figure 3: 3D-2D projections. Affinity scores for 3D primitives are derived by projecting them onto multi-view 2D masks. In the provided example, accurate masks (first row) confirm object unity, like parts of a stool, while incorrect masks (second row) introduce noise in affinity assessment. Points occluded in images (third row) or outside image boundaries (fourth row) are excluded from affinity score calculations to ensure segmentation accuracy.
  • Figure 4: Multi-level merging criteria. (Left) Compared with the vanilla region growing that accumulates merging errors during the growing process, our approach achieves better results using multi-level merging criteria. (Right) The vanilla algorithm mistakenly merges the entire table with the ground, triggered merely by an incorrect affinity between tiny segments of the table leg and the ground.
  • Figure 5: Different thresholding strategies. Without dynamic thresholding, the segmentation results can be sensitive to the manual affinity threshold. A lower threshold is prone to under-segmentation, such as the merged television and the wall (first column). Conversely, a higher threshold may result in over-segmentation, breaking objects into messy parts (third column). Our progressive thresholding introduces a dynamic threshold along with the merging process, thus contributing to more robust and accurate segmentation.
  • ...and 5 more figures