Table of Contents
Fetching ...

Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation

Myrna C. Silva, Mahtab Dahaghin, Matteo Toso, Alessio Del Bue

TL;DR

This work addresses the difficulty of obtaining reliable 3D scene segmentation with limited 3D annotations by embedding a 3D segmentation feature field into a 3D Gaussian Splatting representation. It introduces a contrastive clustering objective on rendered 3D features and a spatial-similarity regularization to learn cross-view-consistent segmentation from inconsistent 2D masks, enabling both 2D novel-view segmentation and 3D scene partitioning. The approach outperforms state-of-the-art methods on open-vocabulary segmentation benchmarks, achieving higher mIoU and boundary alignment while maintaining real-time rendering capabilities. This yields a practical, scalable path for 3D scene understanding in cluttered or open-domain environments, with potential for integration with language-enabled prompts and hierarchical segmentation in future work.

Abstract

We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $α$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $α$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.

Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation

TL;DR

This work addresses the difficulty of obtaining reliable 3D scene segmentation with limited 3D annotations by embedding a 3D segmentation feature field into a 3D Gaussian Splatting representation. It introduces a contrastive clustering objective on rendered 3D features and a spatial-similarity regularization to learn cross-view-consistent segmentation from inconsistent 2D masks, enabling both 2D novel-view segmentation and 3D scene partitioning. The approach outperforms state-of-the-art methods on open-vocabulary segmentation benchmarks, achieving higher mIoU and boundary alignment while maintaining real-time rendering capabilities. This yields a practical, scalable path for 3D scene understanding in cluttered or open-domain environments, with potential for integration with language-enabled prompts and hierarchical segmentation in future work.

Abstract

We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by over the state of the art. Code and trained models will be released soon.
Paper Structure (23 sections, 6 equations, 4 figures, 3 tables)

This paper contains 23 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The objective of Contrastive Gaussian Clustering is to take (a) a set of input images and (b) their independent segmentation masks and (c) distill their information in a model based on 3DGS. This model can then be used for (d) a wide range of visual and segmentation downstream tasks, such as novel view synthesis, retrieving the mask of a selected object, or 3D scene segmentation.
  • Figure 2: Pipeline: (a) Given a set of images from different viewpoints, we use (b) a foundation model for image segmentation to generate 2D segmentation masks. We capture the appearance of the scene via (c) a rendering loss that, like in traditional 3DGS, optimizes the geometry and color of (d) our Contrastive Gaussian Clustering model. Simultaneously, (e) a contrastive loss on the rendered-features learns a 3D segmentation feature field, which is also encoded in (d) our scene model. Moreover, we use (f) a spatial-similarity regularization mechanism, encouraging the segmentation features to be similar for neighboring Gaussians and different for faraway Gaussians.
  • Figure 3: Qualitative comparison of test views for scenes on LERF-Mask dataset. Our method is able to generate accurate instance segmentation masks for any object on in-the-wild scenes. We replicate and exceed the results in green apple, pork belly, apple. LangSplat exhibits noisy segmentation mask for old-camera and coarse segmentation for sheep. Gaussian Grouping misclassified some pixels outside yellow bowl and pork belly (marked with a blue circle) or classify two objects in the same category in waving basket.
  • Figure 4: In this experiment, we extract the Gaussians that belong to the red toy chair. We first compute a discriminative feature, following the same procedure as described in Section \ref{['subsubsec:instance_seg']}. Then, we filter the 3D Gaussians by computing its similarity score. The final result is the 3D segmentation of the red toy chair. Observe that without our spatial-similarity regularization loss, the 3D segmentation is affected by a high number of outliers. Though these outliers can be easily removed by modifying the similarity threshold, we point-out that the outliers for a fixing similarity threshold is minimum when we use our spatial-similarity regularization loss.