Table of Contents
Fetching ...

OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies

Runnan Chen, Xiangyu Sun, Zhaoqing Wang, Youquan Liu, Jiepeng Wang, Lingdong Kong, Jiankang Deng, Mingming Gong, Liang Pan, Wenping Wang, Tongliang Liu

TL;DR

OVGaussian tackles open-vocabulary 3D scene understanding with a Gaussian-based representation by learning per-Gaussian semantic vectors that can be rendered into multi-view 2D semantic maps. The framework introduces Generalizable Semantic Rasterization (GSR) to produce view-consistent semantic vectors across scenes and Cross-modal Consistency Learning (CCL) to align 3D semantics with text embeddings and dense 2D semantics, enabling cross-scene and cross-domain generalization. A large SegGaussian dataset of 288 Gaussian scenes with semantic and instance annotations supports open-vocabulary training and evaluation. Empirical results show state-of-the-art performance on cross-scene, open-vocabulary, novel-view, and cross-domain metrics, demonstrating robust semantic generalization in diverse 3D scenes. The work provides practical resources (SegGaussian and code) and offers a scalable path toward open-vocabulary 3D perception in robotics, AR, and autonomous systems.

Abstract

Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).

OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies

TL;DR

OVGaussian tackles open-vocabulary 3D scene understanding with a Gaussian-based representation by learning per-Gaussian semantic vectors that can be rendered into multi-view 2D semantic maps. The framework introduces Generalizable Semantic Rasterization (GSR) to produce view-consistent semantic vectors across scenes and Cross-modal Consistency Learning (CCL) to align 3D semantics with text embeddings and dense 2D semantics, enabling cross-scene and cross-domain generalization. A large SegGaussian dataset of 288 Gaussian scenes with semantic and instance annotations supports open-vocabulary training and evaluation. Empirical results show state-of-the-art performance on cross-scene, open-vocabulary, novel-view, and cross-domain metrics, demonstrating robust semantic generalization in diverse 3D scenes. The work provides practical resources (SegGaussian and code) and offers a scalable path toward open-vocabulary 3D perception in robotics, AR, and autonomous systems.

Abstract

Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).
Paper Structure (42 sections, 8 equations, 16 figures, 2 tables)

This paper contains 42 sections, 8 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: We introduce OVGaussian, a novel approach that extends Gaussian-based representations for open-vocabulary semantic generalization across scenes. Unlike previous methods (upper part: (a)$\rightarrow$(b)) that restrict open-vocabulary querying to specific trained scenes, OVGaussian (lower part: (c)$\rightarrow$(d)) is trained on the SegGaussian dataset, enabling it to directly predict semantic property for each Gaussian in novel scenes, thereby achieving cross-scene open-vocabulary query.
  • Figure 2: Illustration of the OVGaussian framework. Our approach combines Generalizable Semantic Rasterization (GSR) to predict semantic properties for 3D Gaussians and Cross-modal Consistency Learning (CCL) to align these properties with vocabulary embeddings and 2D visual features. Trained on the SegGaussian dataset, OVGaussian enables cross-scene, open-vocabulary segmentation, achieving robust semantic generalization across diverse 3D scenes and viewpoints.
  • Figure 3: Illustration of the adapter network. The adapter network takes voxel features and Gaussian attributes (rotation, color, scaling, and opacity) as inputs, processes them through a series of linear transformations and attention-based operations, and outputs a refined semantic vector for each Gaussian. This network enables effective multi-granularity fusion, capturing both local and global semantic information for each Gaussian point.
  • Figure 4: Quantitative comparisons of 3D Cross-Scene Accuracy (CSA) across different methods: CLIP2Scene, Gaussian Grouping, and OVGaussian. The figure highlights our enhanced segmentation accuracy and consistency, especially in handling complex scene details.
  • Figure 5: Visualization of cross-view consistency in novel viewpoints. It illustrates OVGaussian’s ability to maintain coherent segmentation across diverse viewpoints, showcasing cross-view consistency in 3D scene understanding.
  • ...and 11 more figures