Table of Contents
Fetching ...

CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting

Wei Sun, Yanzhao Zhou, Jianbin Jiao, Yuan Li

TL;DR

This work addresses cross-view granularity inconsistencies in open-vocabulary 3D scene understanding achieved with 3D Gaussian Splatting by introducing Context-Aware Gaussian Splatting (CAGS). CAGS combines (i) Contextual Feature Propagation through local graphs to aggregate spatial context, (ii) Mask-Aware Contrastive Learning on mask centroids to smooth SAM-derived features across views, and (iii) a Precomputation strategy that freezes Gaussian positions and stores neighborhood relations for scalable training. It further integrates Instance Clustering and Semantic Matching to align 3D instances with 2D open-vocabulary priors, achieving significant improvements in 3D instance segmentation and reduced fragmentation on LERF-OVS and ScanNet. The approach enables robust language-guided 3D understanding in large-scale scenes, with potential impacts on robotics and augmented reality applications requiring reliable multi-view semantic grounding.

Abstract

Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a "coffee set" segmented as a single entity in one view but as "cup + coffee + spoon" in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.

CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting

TL;DR

This work addresses cross-view granularity inconsistencies in open-vocabulary 3D scene understanding achieved with 3D Gaussian Splatting by introducing Context-Aware Gaussian Splatting (CAGS). CAGS combines (i) Contextual Feature Propagation through local graphs to aggregate spatial context, (ii) Mask-Aware Contrastive Learning on mask centroids to smooth SAM-derived features across views, and (iii) a Precomputation strategy that freezes Gaussian positions and stores neighborhood relations for scalable training. It further integrates Instance Clustering and Semantic Matching to align 3D instances with 2D open-vocabulary priors, achieving significant improvements in 3D instance segmentation and reduced fragmentation on LERF-OVS and ScanNet. The approach enables robust language-guided 3D understanding in large-scale scenes, with potential impacts on robotics and augmented reality applications requiring reliable multi-view semantic grounding.

Abstract

Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a "coffee set" segmented as a single entity in one view but as "cup + coffee + spoon" in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.

Paper Structure

This paper contains 16 sections, 13 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of cross-view granularity inconsistency in 3DGS-based open-vocabulary scene understanding. A "plate of cookies" may be segmented as a single entity in one view but split into "cookie + cookie + cookie" in another, and a "coffee set" may appear unified in one view but fragmented into "cup + coffee + spoon" in another.
  • Figure 2: Overview of the Context-Aware Gaussian Splatting (CAGS) pipeline for open-vocabulary 3D scene understanding, starting with optimized 3D Gaussians. The pipeline includes: (1) Local Graph Sampling, (2) Neighborhood Feature Aggregation, (3) Global Feature Propagation (GFP), (4) Mask-Aware Contrastive Learning, and (5) Instance Clustering and Semantic Matching.
  • Figure 3: Text query visualization on the LERF-OVS dataset. Columns compare reference images with segmentations from OpenGaussian, Dr.Splat, and our CAGS method for queries like "egg" and "rubik's cube." CAGS achieves more accurate target identification with fewer noisy fragments, avoiding nearby objects and background noise.
  • Figure 4: Comparison of feature visualizations on the ScanNet dataset. Rows represent different scenes, highlighting the ability of each method to capture semantic features for open-vocabulary 3D scene understanding.