4-LEGS: 4D Language Embedded Gaussian Splatting

Gal Fiebelman; Tamir Cohen; Ayellet Morgenstern; Peter Hedman; Hadar Averbuch-Elor

4-LEGS: 4D Language Embedded Gaussian Splatting

Gal Fiebelman, Tamir Cohen, Ayellet Morgenstern, Peter Hedman, Hadar Averbuch-Elor

TL;DR

This paper introduces 4D Language Embedded Gaussian Splatting (4-LEGS), a framework that grounds natural language queries in dynamic 3D scenes by attaching a 4D language field to a dynamic Gaussian Splatting representation. It leverages ViCLIP to extract pixel-aligned spatio-temporal features, distills them into a scene-specific latent space via an autoencoder, and attends to local neighborhoods to produce coherent 4D language-grounded maps. Open-vocabulary querying is performed directly in 4D, yielding temporal localization and pixel-level grounding with efficient inference on a single GPU. The authors also construct the Grounding-PanopticSports benchmark and demonstrate significant improvements over 2D baselines and static 3D language grounding, enabling practical text-driven spatio-temporal video editing and highlighting across multiple scenes. This work paves the way for interactive, language-guided manipulation and analysis of dynamic volumetric scenes in AR/VR and volumetric VQA contexts.

Abstract

The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.

4-LEGS: 4D Language Embedded Gaussian Splatting

TL;DR

Abstract

4-LEGS: 4D Language Embedded Gaussian Splatting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)