OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction
Hongyang Li, Jinyuan Qu, Lei Zhang
TL;DR
OVSeg3R tackles open-vocabulary 3D instance segmentation by leveraging 3D reconstructions and 2D segmentation models to generate supervision without heavy per-scene 3D annotations. The approach introduces View-wise Instance Partition (VIP) to map partial, view-specific annotations to scene-level predictions and 2D Instance Boundary-aware Superpoints (IBSp) to preserve instance boundaries during superpoint clustering. By coupling SegDINO3D-VL with a CLIP-based text encoder and using reconstruction-derived correspondences, OVSeg3R achieves improved open-vocabulary generalization and state-of-the-art performance, notably +$2.3$ mAP on ScanNet200 and +$7.7$ mAP on novel classes under open settings. The method reduces tail-head performance gaps, demonstrates robustness to reconstruction quality, and maintains compatibility with traditional scene-level supervision, enabling flexible data usage for open-vocabulary 3D understanding in real-world scenarios.
Abstract
In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.
