Table of Contents
Fetching ...

4-LEGS: 4D Language Embedded Gaussian Splatting

Gal Fiebelman, Tamir Cohen, Ayellet Morgenstern, Peter Hedman, Hadar Averbuch-Elor

TL;DR

This paper introduces 4D Language Embedded Gaussian Splatting (4-LEGS), a framework that grounds natural language queries in dynamic 3D scenes by attaching a 4D language field to a dynamic Gaussian Splatting representation. It leverages ViCLIP to extract pixel-aligned spatio-temporal features, distills them into a scene-specific latent space via an autoencoder, and attends to local neighborhoods to produce coherent 4D language-grounded maps. Open-vocabulary querying is performed directly in 4D, yielding temporal localization and pixel-level grounding with efficient inference on a single GPU. The authors also construct the Grounding-PanopticSports benchmark and demonstrate significant improvements over 2D baselines and static 3D language grounding, enabling practical text-driven spatio-temporal video editing and highlighting across multiple scenes. This work paves the way for interactive, language-guided manipulation and analysis of dynamic volumetric scenes in AR/VR and volumetric VQA contexts.

Abstract

The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.

4-LEGS: 4D Language Embedded Gaussian Splatting

TL;DR

This paper introduces 4D Language Embedded Gaussian Splatting (4-LEGS), a framework that grounds natural language queries in dynamic 3D scenes by attaching a 4D language field to a dynamic Gaussian Splatting representation. It leverages ViCLIP to extract pixel-aligned spatio-temporal features, distills them into a scene-specific latent space via an autoencoder, and attends to local neighborhoods to produce coherent 4D language-grounded maps. Open-vocabulary querying is performed directly in 4D, yielding temporal localization and pixel-level grounding with efficient inference on a single GPU. The authors also construct the Grounding-PanopticSports benchmark and demonstrate significant improvements over 2D baselines and static 3D language grounding, enabling practical text-driven spatio-temporal video editing and highlighting across multiple scenes. This work paves the way for interactive, language-guided manipulation and analysis of dynamic volumetric scenes in AR/VR and volumetric VQA contexts.

Abstract

The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.

Paper Structure

This paper contains 26 sections, 13 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: 4D Language Field Optimization. Given multiple videos capturing a dynamic 3D scene, we first extract pixel-aligned spatio-temporal language features at multiple scales using a video-text model. We average these features to produce spatio-temporal features, which are encoded into a more compact latent space that is used for supervising the optimization of a 4D language embedded Gaussian.
  • Figure 2: Comparison to 2D spatio-temporal grounding. We show spatio-temporal localization results for TubeDETR yang2022tubedetr and CGSTVG gu2024context, 2D baseline methods, along with our results. The textual queries and ground-truth segmentation maps are taken from our Grounding-PanopticSports benchmark. Results are illustrated over three different camera viewpoints (shown on different rows) and four different timestamps (shown on different columns). As illustrated above, these 2D methods cannot generate view consistent results, and often fails to correctly localize the queried text. Our approach, on the other hand, allows for localizing the regions in both space and time.
  • Figure 3: Spatio-temporal localization given a set of 4-LEGS, representing multiple dynamic 3D environments. We query all six scenes from the Panoptic Sports dataset, each illustrated by a single frame above, over different queries shown in unique colors. As illustrated by the matching colors, our approach allows for retrieving the correct dynamic environment.
  • Figure 4: Qualitative ablation results for the input query A person holding the ball. We ablate the use of a volumetric representation (2D features), the use of the ViCLIP video encoder ($\text{Static}_{\text{CLIP}}$, $\text{AVG}_{\text{CLIP}}$). As illustrated above, our approach outperforms these ablations -- both spatially and temporally, yielding more accurate results as our extracted features can better capture temporal different, and also because we learn 3D-consistent representations.
  • Figure 5: We show spatio-temporal localization results for LangSplat (LS) which embeds language features onto representations depicting static 3D environments, along with our results, over three different timesteps (shown on different columns) and two different camera viewpoints (shown in different rows). LS outputs high probabilities for all timesteps, showing difficulty in capturing the temporal segment of the action. As demonstrated above, methods targeting static 3D environments are not intended to operate over such dynamic settings.
  • ...and 9 more figures