Table of Contents
Fetching ...

DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting

Luis Wiedmann, Luca Wiehe, David Rozenberszki

TL;DR

DCSEG tackles open-set 3D semantic segmentation by decoupling a 3D Gaussian Splatting–based scene representation from semantic labeling. Stage 1 creates class-agnostic 3D masks via contrastive features and HDBSCAN clustering; Stage 2 assigns semantic labels by matching these masks to 2D open-vocabulary masks with a Hungarian-based solver, all in a modular, retraining-free framework. The approach demonstrates competitive mIoU and mAcc on Replica and ScanNet, with strong tail-class performance and clear modularity when swapping 2D backbones (e.g., OVSeg vs OpenSeg). This decoupled design enables flexible integration of future 2D/3D components and supports multi-object instance segmentation without additional training, which is impactful for robotics and AR/VR applications needing open-vocabulary 3D perception. Overall, DCSEG shows that leveraging explicit 3D geometry with decoupled semantic priors yields robust, scalable 3D scene understanding with practical efficiency.

Abstract

Open-set 3D segmentation represents a major point of interest for multiple downstream robotics and augmented/virtual reality applications. We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations as well as semantic segmentation foundation models. We first reconstruct a scene with 3D Gaussians and learn class-agnostic features through contrastive supervision from a 2D instance proposal network. These 3D features are then clustered to form coarse object- or part-level masks. Finally, we match each 3D cluster to class-aware masks predicted by a 2D open-vocabulary segmentation model, assigning semantic labels without retraining the 3D representation. Our decoupled design (1) provides a plug-and-play interface for swapping different 2D or 3D modules, (2) ensures multi-object instance segmentation at no extra cost, and (3) leverages rich 3D geometry for robust scene understanding. We evaluate on synthetic and real-world indoor datasets, demonstrating improved performance over comparable NeRF-based pipelines on mIoU and mAcc, particularly for challenging or long-tail classes. We also show how varying the 2D backbone affects the final segmentation, highlighting the modularity of our framework. These results confirm that decoupling 3D mask proposal and semantic classification can deliver flexible, efficient, and open-vocabulary 3D segmentation.

DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting

TL;DR

DCSEG tackles open-set 3D semantic segmentation by decoupling a 3D Gaussian Splatting–based scene representation from semantic labeling. Stage 1 creates class-agnostic 3D masks via contrastive features and HDBSCAN clustering; Stage 2 assigns semantic labels by matching these masks to 2D open-vocabulary masks with a Hungarian-based solver, all in a modular, retraining-free framework. The approach demonstrates competitive mIoU and mAcc on Replica and ScanNet, with strong tail-class performance and clear modularity when swapping 2D backbones (e.g., OVSeg vs OpenSeg). This decoupled design enables flexible integration of future 2D/3D components and supports multi-object instance segmentation without additional training, which is impactful for robotics and AR/VR applications needing open-vocabulary 3D perception. Overall, DCSEG shows that leveraging explicit 3D geometry with decoupled semantic priors yields robust, scalable 3D scene understanding with practical efficiency.

Abstract

Open-set 3D segmentation represents a major point of interest for multiple downstream robotics and augmented/virtual reality applications. We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations as well as semantic segmentation foundation models. We first reconstruct a scene with 3D Gaussians and learn class-agnostic features through contrastive supervision from a 2D instance proposal network. These 3D features are then clustered to form coarse object- or part-level masks. Finally, we match each 3D cluster to class-aware masks predicted by a 2D open-vocabulary segmentation model, assigning semantic labels without retraining the 3D representation. Our decoupled design (1) provides a plug-and-play interface for swapping different 2D or 3D modules, (2) ensures multi-object instance segmentation at no extra cost, and (3) leverages rich 3D geometry for robust scene understanding. We evaluate on synthetic and real-world indoor datasets, demonstrating improved performance over comparable NeRF-based pipelines on mIoU and mAcc, particularly for challenging or long-tail classes. We also show how varying the 2D backbone affects the final segmentation, highlighting the modularity of our framework. These results confirm that decoupling 3D mask proposal and semantic classification can deliver flexible, efficient, and open-vocabulary 3D segmentation.

Paper Structure

This paper contains 25 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Decoupling the semantic segmentation pipeline. We present DCSEG, a holistic 3D reconstruction and scene understanding method. At the core of our method, we leverage pre-trained 2D foundation models to recognize uniform semantic concepts in 2D images of 3D scenes and use these predicted masks as contrastive optimization targets from multi-view images to class-agnostic 3D instances and object parts. These features are then used to cluster the Gaussians in 3D with hierarchical clustering methods. Simultaneously, we use a 2D semantic segmentation network to obtain class-aware masks and aggregate class-agnostic parts into meaningful semantic instances. As a result, we obtain 2D/3D instance and semantic segmentation on synthetic and real-world scenes.
  • Figure 2: Segmentation results of our method (DCSEG) compared to the ground truth and OpenNeRF. Our segmentation masks can detect boundaries more accurately e.g. the blanket/pillows or the wall behind the bed-lamps. Large uniform areas, such as the floor, can be detected with significantly less noise. Switching between Openseg and OVSeg can be done without retraining and demonstrates adaptability with respect to foundation models.
  • Figure 3: Shortcomings of the ScanNet GT. Our Method accurately recognizes and segments the posters on the wall, but they are not represented in the provided ScanNet Ground Truth, therefore hurting our performance despite a more accurate segmentation of the scene.
  • Figure 4: Further Results on ScanNet (scene0030_01) and Replica (office2)
  • Figure 5: Visualization of class-agnostic masks. Our mask proposal tends to propose instances, as demonstrated by the three separately identified towels and two armchair instances.
  • ...and 2 more figures