Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting
Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, Stefano Gasperini
TL;DR
VALA tackles open-vocabulary segmentation in 3D Gaussian Splatting by solving two key issues: per-ray misassignment of 2D language features and multi-view semantic drift caused by occlusion and view-dependent cues. The method combines a visibility-aware gating mechanism, based on per-ray contributions $w_i(r)=\alpha_i(r)T_i(r)$, with a streaming cosine-space geometric median to fuse multi-view features into a coherent 3D language embedding. It achieves state-of-the-art results on LeRF-OVS and ScanNet-v2, while remaining training-free and memory-efficient, with runtime on the order of $10$ seconds to $1$ minute per scene on an RTX $4090$. The approach improves both 2D and 3D open-vocabulary grounding, demonstrates robustness to occlusions and noisy masks, and offers practical benefits for open-world 3D scene understanding and robotics.
Abstract
Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works. More results are available at https://vala3d.github.io
