DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting
Luis Wiedmann, Luca Wiehe, David Rozenberszki
TL;DR
DCSEG tackles open-set 3D semantic segmentation by decoupling a 3D Gaussian Splatting–based scene representation from semantic labeling. Stage 1 creates class-agnostic 3D masks via contrastive features and HDBSCAN clustering; Stage 2 assigns semantic labels by matching these masks to 2D open-vocabulary masks with a Hungarian-based solver, all in a modular, retraining-free framework. The approach demonstrates competitive mIoU and mAcc on Replica and ScanNet, with strong tail-class performance and clear modularity when swapping 2D backbones (e.g., OVSeg vs OpenSeg). This decoupled design enables flexible integration of future 2D/3D components and supports multi-object instance segmentation without additional training, which is impactful for robotics and AR/VR applications needing open-vocabulary 3D perception. Overall, DCSEG shows that leveraging explicit 3D geometry with decoupled semantic priors yields robust, scalable 3D scene understanding with practical efficiency.
Abstract
Open-set 3D segmentation represents a major point of interest for multiple downstream robotics and augmented/virtual reality applications. We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations as well as semantic segmentation foundation models. We first reconstruct a scene with 3D Gaussians and learn class-agnostic features through contrastive supervision from a 2D instance proposal network. These 3D features are then clustered to form coarse object- or part-level masks. Finally, we match each 3D cluster to class-aware masks predicted by a 2D open-vocabulary segmentation model, assigning semantic labels without retraining the 3D representation. Our decoupled design (1) provides a plug-and-play interface for swapping different 2D or 3D modules, (2) ensures multi-object instance segmentation at no extra cost, and (3) leverages rich 3D geometry for robust scene understanding. We evaluate on synthetic and real-world indoor datasets, demonstrating improved performance over comparable NeRF-based pipelines on mIoU and mAcc, particularly for challenging or long-tail classes. We also show how varying the 2D backbone affects the final segmentation, highlighting the modularity of our framework. These results confirm that decoupling 3D mask proposal and semantic classification can deliver flexible, efficient, and open-vocabulary 3D segmentation.
