Table of Contents
Fetching ...

Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, Stefano Gasperini

TL;DR

VALA tackles open-vocabulary segmentation in 3D Gaussian Splatting by solving two key issues: per-ray misassignment of 2D language features and multi-view semantic drift caused by occlusion and view-dependent cues. The method combines a visibility-aware gating mechanism, based on per-ray contributions $w_i(r)=\alpha_i(r)T_i(r)$, with a streaming cosine-space geometric median to fuse multi-view features into a coherent 3D language embedding. It achieves state-of-the-art results on LeRF-OVS and ScanNet-v2, while remaining training-free and memory-efficient, with runtime on the order of $10$ seconds to $1$ minute per scene on an RTX $4090$. The approach improves both 2D and 3D open-vocabulary grounding, demonstrates robustness to occlusions and noisy masks, and offers practical benefits for open-world 3D scene understanding and robotics.

Abstract

Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works. More results are available at https://vala3d.github.io

Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

TL;DR

VALA tackles open-vocabulary segmentation in 3D Gaussian Splatting by solving two key issues: per-ray misassignment of 2D language features and multi-view semantic drift caused by occlusion and view-dependent cues. The method combines a visibility-aware gating mechanism, based on per-ray contributions , with a streaming cosine-space geometric median to fuse multi-view features into a coherent 3D language embedding. It achieves state-of-the-art results on LeRF-OVS and ScanNet-v2, while remaining training-free and memory-efficient, with runtime on the order of seconds to minute per scene on an RTX . The approach improves both 2D and 3D open-vocabulary grounding, demonstrates robustness to occlusions and noisy masks, and offers practical benefits for open-world 3D scene understanding and robotics.

Abstract

Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works. More results are available at https://vala3d.github.io

Paper Structure

This paper contains 17 sections, 21 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Thanks to its feature aggregation that is visibility-aware and multi-view consistent, our proposed VALA is the most accurate and as quick as the fastest cheng2024occam to optimize. Comparison in 3D open-vocabulary segmentation on the LeRF-OVS dataset qin2024langsplat.
  • Figure 2: Overview of VALA. The framework is shown on the left, with the orange and green blocks detailed on the right being our key contributions: the visibility-aware feature lifting (orange, Section \ref{['sec:gating']}), and the robust multi-view aggregation (green, Section \ref{['sec:median']}).
  • Figure 3: Visibility-aware gating for semantic assignment (Section \ref{['sec:gating']}). Simplified representation of a scene with two objects (a) $O_1,O_2$ and a camera ray $r$ with Gaussians $g_1,g_2$. We compute the opacity (b) and compute the transmittance front-to-back (c). Then we calculate the contribution weights for each ray, thresholding with $\tau$ (d). Instead of propagating the features to all Gaussians as prior works do, our gating only propagates to the visible ones (e).
  • Figure 4: Qualitative 3D objects selections on LeRF-OVS qin2024langsplat. We mark as failed those with low or zero IoU with the ground truth (red).
  • Figure 5: Qualitative results of 3D semantic segmentation with 19 classes on the ScanNet-v2 dataset dai2017scannet.
  • ...and 3 more figures