CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting
Wei Sun, Yanzhao Zhou, Jianbin Jiao, Yuan Li
TL;DR
This work addresses cross-view granularity inconsistencies in open-vocabulary 3D scene understanding achieved with 3D Gaussian Splatting by introducing Context-Aware Gaussian Splatting (CAGS). CAGS combines (i) Contextual Feature Propagation through local graphs to aggregate spatial context, (ii) Mask-Aware Contrastive Learning on mask centroids to smooth SAM-derived features across views, and (iii) a Precomputation strategy that freezes Gaussian positions and stores neighborhood relations for scalable training. It further integrates Instance Clustering and Semantic Matching to align 3D instances with 2D open-vocabulary priors, achieving significant improvements in 3D instance segmentation and reduced fragmentation on LERF-OVS and ScanNet. The approach enables robust language-guided 3D understanding in large-scale scenes, with potential impacts on robotics and augmented reality applications requiring reliable multi-view semantic grounding.
Abstract
Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a "coffee set" segmented as a single entity in one view but as "cup + coffee + spoon" in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.
