Table of Contents
Fetching ...

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool

TL;DR

GaussianVLM introduces a detector-free, scene-centric 3D Vision-Language Model that operates on expressive Gaussian splats and directly embeds language features into the scene. A language-aligned SceneSplat backbone paired with a dual sparsifier compresses dense language-augmented 3D representations into a compact token set fed to a frozen LLM, enabling robust embodied reasoning. The approach achieves state-of-the-art results on scene-centric benchmarks and generalizes well to RGB-derived 3D data, aided by a new object-counting OOD dataset. By removing object detectors and emphasizing global scene context, GaussianVLM advances open-ended spatial reasoning in 3D vision-language tasks while maintaining efficiency through targeted sparsification.

Abstract

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

TL;DR

GaussianVLM introduces a detector-free, scene-centric 3D Vision-Language Model that operates on expressive Gaussian splats and directly embeds language features into the scene. A language-aligned SceneSplat backbone paired with a dual sparsifier compresses dense language-augmented 3D representations into a compact token set fed to a frozen LLM, enabling robust embodied reasoning. The approach achieves state-of-the-art results on scene-centric benchmarks and generalizes well to RGB-derived 3D data, aided by a new object-counting OOD dataset. By removing object detectors and emphasizing global scene context, GaussianVLM advances open-ended spatial reasoning in 3D vision-language tasks while maintaining efficiency through targeted sparsification.

Abstract

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

Paper Structure

This paper contains 27 sections, 3 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: The proposed GaussianVLM performs comprehensive scene understanding in natural language for 3D scenes represented as Gaussian Splats. It adopts a fully scene-centric approach, building a global, language-augmented scene representation. This enables effective handling of both scene- and object-level tasks -- requiring multi-object reasoning, spatial understanding, global context, and fine-grained analysis -- suitable for embodied reasoning and beyond.
  • Figure 2: The GaussianVLM architecture processes a user task prompt (query and optional location) and a 3D scene (Gaussian Splat representation). A 3D vision module (SceneSplat Transformer) predicts per-Gaussian language features. These dense features are then sparsified by a dual sparsifier module. The decoder's hidden states also inform the task-guided sparsifier. The dual sparsifier comprises: 1) a location-guided pathway that selects language features from Gaussians within a ROI around the task location, producing ROI tokens; and 2) a task-guided pathway that attends to dense scene tokens and SceneSplat decoder hidden states using task tokens (via cross-attention) to produce 128 task-selected scene tokens. The resulting sparse scene representation (ROI tokens + task-selected tokens), along with the task tokens, is input to an LLM for response generation.
  • Figure 3: Qualitative results on scene-centric tasks.
  • Figure 4: Qualitative results on object-centric tasks.
  • Figure 5: Distribution of the questions on object counts, answered correctly by GaussianVLM. The distribution is according to object class labels. Overall, 254 questions answered correctly.
  • ...and 8 more figures