Table of Contents
Fetching ...

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister

TL;DR

This work addresses the lack of effective open-vocabulary querying in dynamic 4D scenes by extending 3D Gaussian Splatting to 4D and decoupling semantic learning into two fields: a time-agnostic CLIP-based field and a time-varying field learned from MLLM-generated per-object captions. A multimodal object-wise video prompting pipeline converts videos into pixel-aligned, object-level captions, which are encoded into sentence embeddings that supervise the time-varying language field. To model temporal evolution, a status deformable network represents per-Gaussian semantics as a weighted combination of K state prototypes, promoting smooth transitions over time. This framework enables accurate time-sensitive and time-agnostic open-vocabulary queries with improved efficiency, validated on HyperNeRF and Neu3D datasets, and demonstrates the advantage of MLLM-driven supervision over traditional vision-based features. The approach yields state-of-the-art results in dynamic semantic grounding and offers practical implications for interactive 4D environments.

Abstract

Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

TL;DR

This work addresses the lack of effective open-vocabulary querying in dynamic 4D scenes by extending 3D Gaussian Splatting to 4D and decoupling semantic learning into two fields: a time-agnostic CLIP-based field and a time-varying field learned from MLLM-generated per-object captions. A multimodal object-wise video prompting pipeline converts videos into pixel-aligned, object-level captions, which are encoded into sentence embeddings that supervise the time-varying language field. To model temporal evolution, a status deformable network represents per-Gaussian semantics as a weighted combination of K state prototypes, promoting smooth transitions over time. This framework enables accurate time-sensitive and time-agnostic open-vocabulary queries with improved efficiency, validated on HyperNeRF and Neu3D datasets, and demonstrates the advantage of MLLM-driven supervision over traditional vision-based features. The approach yields state-of-the-art results in dynamic semantic grounding and offers practical implications for interactive 4D environments.

Abstract

Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

Paper Structure

This paper contains 18 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Visualization of the learned language features of our 4D LangSplat. We observe that 4D LangSplat effectively learns dynamic semantic features that change over time, such as the gradual diffusion of coffee shown in the first two rows, and the "chicken" toggling between open and closed states in the latter two rows. Additionally, our semantic field captures consistent features for semantics that remain unchanged over time, with the clear object boundaries in the visualization demonstrating the precision of our semantic field.
  • Figure 2: The framework of constructing a time-varying semantic field in 4D LangSplat. We first use multimodal object-wise prompting to convert a video into pixel-aligned object-level caption features. Then, we learn a 4D language field with a status deformable network.
  • Figure 3: Visualization of time-sensitive querying results between Deformable CLIP and ours. The bottom row depicts the cosine similarity across frames, rescaled to (0,1) for direct comparison, while the horizontal bars indicate frames identified as relevant time segments. We observed that the CLIP-based method cannot understand dynamic semantics correctly, while our method recognizes them.
  • Figure 4: Comparison of time-sensitive query mask. We compare time-sensitive query masks between Deformable CLIP and ours. The CLIP-based method fails to identify time segments accurately, especially at the demarcation points during state transitions.
  • Figure 5: Visualization of time-agnostic querying results on HyperNeRF park2021hypernerf and Neu3D li2022neural datasets.