Table of Contents
Fetching ...

SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Chaolei Wang, Yang Luo, Jing Du, Siyu Chen, Yiping Chen, Ting Han

TL;DR

SGS-3D tackles the pervasive errors in 3D instance segmentation arising from 2D-to-3D lifting by introducing a training-free split-then-grow refinement that fuses semantic and geometric cues. The method employs an occlusion-aware point-image mapping, co-occurrence-driven 2D mask filtering, and a semantic-guided aggregation pipeline with density-based splitting, feature-guided growing, and multi-view merging. It achieves state-of-the-art performance among training-free approaches on ScanNet200, ScanNet++, and KITTI-360, with notable robustness in depth-less outdoor environments, and enables open-set 3D understanding when combined with vision-language models. The approach provides a practical, generalizable bridge between 2D semantic foundations and 3D geometry for high-fidelity, class-agnostic 3D instance segmentation.

Abstract

Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available at https://github.com/wangchaolei7/SGS-3D.

SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

TL;DR

SGS-3D tackles the pervasive errors in 3D instance segmentation arising from 2D-to-3D lifting by introducing a training-free split-then-grow refinement that fuses semantic and geometric cues. The method employs an occlusion-aware point-image mapping, co-occurrence-driven 2D mask filtering, and a semantic-guided aggregation pipeline with density-based splitting, feature-guided growing, and multi-view merging. It achieves state-of-the-art performance among training-free approaches on ScanNet200, ScanNet++, and KITTI-360, with notable robustness in depth-less outdoor environments, and enables open-set 3D understanding when combined with vision-language models. The approach provides a practical, generalizable bridge between 2D semantic foundations and 3D geometry for high-fidelity, class-agnostic 3D instance segmentation.

Abstract

Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available at https://github.com/wangchaolei7/SGS-3D.

Paper Structure

This paper contains 23 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: SGS-3D: High-Fidelity segmentation by overcoming ambiguous 2D-to-3D lifting. Previous methods suffer from flawed instance grouping (a), caused by ambiguous 2D semantics and inadequately unhandled occlusions during projection. SGS-3D overcomes this by establishing reliable 3D semantics via occlusion-aware mapping and a novel "split-then-grow" refinement (b). This dual-refinement strategy achieves state-of-the-art accuracy across diverse scenes, especially in challenging, depth-less outdoor environments (c).
  • Figure 2: Overview of the training-free SGS-3D pipeline. Our method begins by computing robust, occlusion-aware point-image mappings without requiring ground-truth depth (§ 3.1). In the 2D Mask Proposal stage (§ 3.2), these mappings guide a Co-occurrence Mask Filtering process to prune ambiguous candidate masks, yielding a set of reliable 2D masks. These are then lifted to 3D and fed into the Feature-Guided Aggregation stage (§ 3.3), which first uses Spatial Continuity Splitting to generate pure semantic-geometric seeds. Subsequently, Feature-Guided Growing expands these seeds into complete instances, which are finally consolidated via Multi-View Progressive Merging to produce the final, high-fidelity object instances.
  • Figure 3: Our method constructs valid visibility mapping for each point. While depth sensors struggle on textureless and highly reflective surfaces, our approach remains effective.
  • Figure 4: Co-occurrence mask filtering strategy. Co-occurrence scores between superpoints are constructed from accurate 2D mask sets (a-c). Over-segmented (d) and under-segmented (e) masks exhibiting low scores are then removed from the candidate masks list, where image sets with $\mathcal{P}_{vis,m}^j = 0$ (f) are excluded from the calculation.
  • Figure 5: Semantic-guided aggregation. Within a single view, a Semantic-geometric Seed (SS) (orange) is expanded into a Super Semantic Seed (SSS) by merging with neighboring superpoints (colorful). Subsequently, these proposals from other views are progressively merged to form the final object instance.
  • ...and 3 more figures