Table of Contents
Fetching ...

OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction

Hongyang Li, Jinyuan Qu, Lei Zhang

TL;DR

OVSeg3R tackles open-vocabulary 3D instance segmentation by leveraging 3D reconstructions and 2D segmentation models to generate supervision without heavy per-scene 3D annotations. The approach introduces View-wise Instance Partition (VIP) to map partial, view-specific annotations to scene-level predictions and 2D Instance Boundary-aware Superpoints (IBSp) to preserve instance boundaries during superpoint clustering. By coupling SegDINO3D-VL with a CLIP-based text encoder and using reconstruction-derived correspondences, OVSeg3R achieves improved open-vocabulary generalization and state-of-the-art performance, notably +$2.3$ mAP on ScanNet200 and +$7.7$ mAP on novel classes under open settings. The method reduces tail-head performance gaps, demonstrates robustness to reconstruction quality, and maintains compatibility with traditional scene-level supervision, enabling flexible data usage for open-vocabulary 3D understanding in real-world scenarios.

Abstract

In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction

TL;DR

OVSeg3R tackles open-vocabulary 3D instance segmentation by leveraging 3D reconstructions and 2D segmentation models to generate supervision without heavy per-scene 3D annotations. The approach introduces View-wise Instance Partition (VIP) to map partial, view-specific annotations to scene-level predictions and 2D Instance Boundary-aware Superpoints (IBSp) to preserve instance boundaries during superpoint clustering. By coupling SegDINO3D-VL with a CLIP-based text encoder and using reconstruction-derived correspondences, OVSeg3R achieves improved open-vocabulary generalization and state-of-the-art performance, notably + mAP on ScanNet200 and + mAP on novel classes under open settings. The method reduces tail-head performance gaps, demonstrates robustness to reconstruction quality, and maintains compatibility with traditional scene-level supervision, enabling flexible data usage for open-vocabulary 3D understanding in real-world scenarios.

Abstract

In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

Paper Structure

This paper contains 21 sections, 6 equations, 7 figures, 4 tables, 3 algorithms.

Figures (7)

  • Figure 1: (a) Traditional training scheme relies on costly manual efforts and non-routine sensors, such as depth cameras, to construct training data and annotations. (b) OVSeg3R leverages modern 3D reconstruction models and well-studied 2D perception models to construct training data and annotations. To further alleviate the issue of partial supervision and over-smoothness to improve training stability, we use the 2D-3D correspondences from 3D reconstruction models to partition scene-level predictions to assist supervision, and leverage 2D instance masks to constrain the superpoints, assisting segmentation.
  • Figure 2: Training SegDINO3D-VL with OVSeg3R. Given an input video, we first apply 3D reconstruction and the 2D instance segmentation to prepare the foundation data. The prepared foundation will further be combined to construct the input and also the view-wise supervision for the 3D instance segmentator SegDINO3D-VL. The reconstructed scene is then fed into SegDINO3D-VL to produce the scene-level instance segmentation results, which are further partitioned to each view by the view-wise instance partition module for stable supervision.
  • Figure 3: (a) Visualization of the 2D instance masks obtained from the well-studied 2D segmentators and their corresponding lifted view-wise 3D instance segmentation annotations. (b) Visual comparison between the superpoint built solely upon geometric continuity (geo.-only) and the proposed IBSp. Due to the over-smoothed nature of reconstructed results, geo.-only superpoints tend to cluster geometrically less salient objects (picture, power outlet, carpet) into their background, preventing them from being segmented out. By incorporating 2D instance boundaries, at least one superpoint is preserved for such objects (highlighted by the red arrows), mitigating the issue.
  • Figure 4: Visualization of segmentation results of OVSeg3R on in-the-wild data. We provide the frames in which each object is most clearly visible in the video as references.
  • Figure 5: Input text prompt: "laptop . mouse . keyboard . power outlet .". Although the power outlet, keyboard, and mouse are not geometrically salient, making them difficult to identify even for humans in the reconstructed 3D point clouds, OVSeg3R can still accurately locate and segment them. For the laptop case, despite local reconstruction failures caused by inaccurate camera parameter estimation during reconstruction, OVSeg3R is still able to segment it (with green mask). Best viewed in the electronic version.
  • ...and 2 more figures