Table of Contents
Fetching ...

Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking

Juan Wang, Yasutomo Kawanishi, Tomo Miyazaki, Zhijie Wang, Shinichiro Omachi

TL;DR

This work tackles the costly reliance on manual 3D annotations for instance segmentation by introducing a Granularity-Consistent automatic 2D Mask Tracking approach that preserves temporal correspondences across video frames. A three-stage curriculum—Stage 1 fragmentation, Stage 2 granularity-consistent multi-view supervision, and Stage 3 full-scene fine-tuning—yields globally coherent 3D pseudo-labels distilled from 2D priors generated by foundation models like SAM and SAM2. The method achieves state-of-the-art results on ScanNet++ and ScanNet200 for class-agnostic 3D segmentation, with real-time inference and notable open-vocabulary capabilities demonstrated via text-based object retrieval and long-tail category understanding. By enforcing cross-frame consistency and gradual exposure to higher-quality annotations, the approach robustly learns a unified 3D representation from initially fragmented 2D priors, enabling practical 3D scene understanding without manual labeling.

Abstract

3D instance segmentation is an important task for real-world applications. To avoid costly manual annotations, existing methods have explored generating pseudo labels by transferring 2D masks from foundation models to 3D. However, this approach is often suboptimal since the video frames are processed independently. This causes inconsistent segmentation granularity and conflicting 3D pseudo labels, which degrades the accuracy of final segmentation. To address this, we introduce a Granularity-Consistent automatic 2D Mask Tracking approach that maintains temporal correspondences across frames, eliminating conflicting pseudo labels. Combined with a three-stage curriculum learning framework, our approach progressively trains from fragmented single-view data to unified multi-view annotations, ultimately globally coherent full-scene supervision. This structured learning pipeline enables the model to progressively expose to pseudo-labels of increasing consistency. Thus, we can robustly distill a consistent 3D representation from initially fragmented and contradictory 2D priors. Experimental results demonstrated that our method effectively generated consistent and accurate 3D segmentations. Furthermore, the proposed method achieved state-of-the-art results on standard benchmarks and open-vocabulary ability.

Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking

TL;DR

This work tackles the costly reliance on manual 3D annotations for instance segmentation by introducing a Granularity-Consistent automatic 2D Mask Tracking approach that preserves temporal correspondences across video frames. A three-stage curriculum—Stage 1 fragmentation, Stage 2 granularity-consistent multi-view supervision, and Stage 3 full-scene fine-tuning—yields globally coherent 3D pseudo-labels distilled from 2D priors generated by foundation models like SAM and SAM2. The method achieves state-of-the-art results on ScanNet++ and ScanNet200 for class-agnostic 3D segmentation, with real-time inference and notable open-vocabulary capabilities demonstrated via text-based object retrieval and long-tail category understanding. By enforcing cross-frame consistency and gradual exposure to higher-quality annotations, the approach robustly learns a unified 3D representation from initially fragmented 2D priors, enabling practical 3D scene understanding without manual labeling.

Abstract

3D instance segmentation is an important task for real-world applications. To avoid costly manual annotations, existing methods have explored generating pseudo labels by transferring 2D masks from foundation models to 3D. However, this approach is often suboptimal since the video frames are processed independently. This causes inconsistent segmentation granularity and conflicting 3D pseudo labels, which degrades the accuracy of final segmentation. To address this, we introduce a Granularity-Consistent automatic 2D Mask Tracking approach that maintains temporal correspondences across frames, eliminating conflicting pseudo labels. Combined with a three-stage curriculum learning framework, our approach progressively trains from fragmented single-view data to unified multi-view annotations, ultimately globally coherent full-scene supervision. This structured learning pipeline enables the model to progressively expose to pseudo-labels of increasing consistency. Thus, we can robustly distill a consistent 3D representation from initially fragmented and contradictory 2D priors. Experimental results demonstrated that our method effectively generated consistent and accurate 3D segmentations. Furthermore, the proposed method achieved state-of-the-art results on standard benchmarks and open-vocabulary ability.

Paper Structure

This paper contains 33 sections, 10 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of pseudo label generation between our method and existing class-agnostic 3D instance segmentation approach Segment3D huang2024segment3d. (a) Input RGB-D video frames from an indoor scene in ScanNet dai2017scannet, showing the same chair object from different viewpoints. (b) Segment3D huang2024segment3d employs Automatic-SAM kirillov2023segment on individual 2D frames to generate frame-specific masks, resulting in inconsistent segmentation granularity. For example, the chair object is segmented with different levels of detail across frames, producing conflicting pseudo-labels in 3D space. (c) Our method incorporates a cross-frame consistent segmentation module that maintains object tracking across 2D frames, ensuring consistent segmentation granularity throughout the video sequence. This produces masks with unified segmentation boundaries across frames, leading to coherent 3D pseudo-labels. For example, in the 3D results shown for the chair, our method generates unified results.
  • Figure 2: An overview of the proposed method. We propose a Granularity-Consistent Segmentation Policy with three-stage curriculum learning pipeline for class-agnostic 3D instance segmentation. Stage 1: From input RGB-D video sequences, we apply SAM-Based Mask Generation to extract initial 2D masks $\mathcal{M}_{t_k}$ on keyframes ${t_k}$, which are then projected to 3D space as frame-independent pseudo labels for fragmented warm-up training of model $\text{Net}_1$. Stage 2: Our Granularity-consistent Segmentation Policy generates 2D Mask $\mathcal{M}_{t}^{consistent}$ and projected as Granularity-Consistent pseudo labels across all frames to finetune and obtain model $\text{Net}_2$. Stage 3: We fine-tune the model on full point clouds with confidence-based filtering to achieve globally coherent class-agnostic 3D instance segmentation, yielding the final model $\text{Net}_3$.
  • Figure 3: Object Status Transitions. Our state management system handles three object states: Active (currently tracked), Dormant (temporarily lost), and Terminated (permanently removed). Transitions are governed by IoU matching thresholds $\tau_{IoU}$, dormancy counters $D_{count}$, and dormancy threshold $\tau_{dorm}$, enabling robust tracking across temporary occlusions and viewpoint changes.
  • Figure 4: Qualitative Comparison of ScanNet++'s ground truth, Segment3D huang2024segment3d and ours.
  • Figure 5: Comparison of 3D text retrieval results between our method, Segment3D, and the fully-supervised OpenMask3D. First row: Segmentation results for 'bottled water' query in an office scene containing three locations. Our method successfully identifies all instances, Segment3D only detects the second location, and OpenMask3D identifies the first two locations but misclassifies coca cola as bottled water. Second row: Segmentation results for 'green comforter' query in a bedroom scene, where our method achieves the most precise segmentation boundaries.
  • ...and 2 more figures