Table of Contents
Fetching ...

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi

TL;DR

Feature4X presents a scalable framework to convert monocular video into interactive 4D scenes by distilling and unifying 2D vision foundation model features into a compact 4D Gaussian feature field. The approach builds on dynamic 3D Gaussian Splatting with a 4D Motion Scaffold, introducing a unified latent feature representation and scaffold-based features that support 2D segmentation, 3D editing, and 4D VQA through lightweight decoders. An LLM-driven agentic AI loop enables language-guided editing, reasoning, and free-form VQA within 4D space, leveraging SAM2, CLIP-LSeg, and InternVideo features to bridge language and vision. Empirical results show competitive appearance reconstruction, robust 4D segmentation, efficient training/inference, and strong 4D reasoning capabilities, highlighting substantial potential for immersive, context-aware 4D agentic AI from casual monocular video.

Abstract

Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

TL;DR

Feature4X presents a scalable framework to convert monocular video into interactive 4D scenes by distilling and unifying 2D vision foundation model features into a compact 4D Gaussian feature field. The approach builds on dynamic 3D Gaussian Splatting with a 4D Motion Scaffold, introducing a unified latent feature representation and scaffold-based features that support 2D segmentation, 3D editing, and 4D VQA through lightweight decoders. An LLM-driven agentic AI loop enables language-guided editing, reasoning, and free-form VQA within 4D space, leveraging SAM2, CLIP-LSeg, and InternVideo features to bridge language and vision. Empirical results show competitive appearance reconstruction, robust 4D segmentation, efficient training/inference, and strong 4D reasoning capabilities, highlighting substantial potential for immersive, context-aware 4D agentic AI from casual monocular video.

Abstract

Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Paper Structure

This paper contains 37 sections, 6 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Method overview. Given an input monocular video, we infer 2D priors to segment static background (represented by static 3D Gaussians augmented with latent features) and dynamic foreground (represented by dynamic 3D Gaussians guided by Motion Scaffolds lei2024mosca, a set of nodes $\{\mathbf{v}_{i}\}$ encoding 3D motion trajectories and latent features $h_i$). Dynamic Gaussian features and motions are computed via interpolation from their $K$-nearest scaffold nodes. At each timestep, dynamic Gaussians are warped and fused with static Gaussians. A parallel rasterization zhou2024feature generates RGB images and a unified latent feature map, decoded into task-specific features—illustrated here by SAM2 ravi2024sam, CLIP-LSeg li2022language, and InternVideo2 wang2024internvideo2 for representative 2D (novel view segmentation), 3D (scene editing), and 4D (spatiotemporal VQA) tasks. Our framework generalizes to any 2D vision foundation model and is trained end-to-end using input RGB frames and customized features from pretrained 2D models. At inference, rendered feature maps from arbitrary views and timesteps are directly fed into task-specific decoders, seamlessly supporting user prompts and LLM interactions to form a unified 4D agentic AI system.
  • Figure 2: Segment Anything in Dynamic 4D Scenes with SAM2 Feature Field. For any rendered novel view video, we support: (a) Promptless segmentation (segment everything): when no user prompt is provided, segmentation masks are automatically assigned at the first frame ($t = 0$) and then propagated across all frames. (b) Promptable segmentation (segment anything): the user can segment any object—static or dynamic—at any timestep using a point or box prompt, and the corresponding mask is robustly tracked and propagated through subsequent frames.
  • Figure 3: Baseline Comparison on SAM2 Inference. We compare segmentation quality and inference speed between (a) the naive RGB-based approach and (b) our feature-based method. Ours achieves comparable segmentation, accurately tracking the object over time, and avoids RGB artifacts (red box region at $t=70$), while reducing inference time to about 4$\times$ speed-up.
  • Figure 4: Semantic 4D Scene Understanding with CLIP Feature Field. By lifting CLIP-LSeg li2022language features into a 4D feature field, we enable pixel-level semantic segmentation from any view at any timestep. This allows robust 4D scene understanding, even as object appearances change over time—for example, accurately identifying a blooming flower from bud to full bloom across views.
  • Figure 5: Scene Editing with AI Agent. Given user prompts, our GPT-powered agent interprets editing intent and autonomously performs scene edits via our 4D CLIP feature field. Examples include both geometric (e.g., "extract" and "delete") and appearance (e.g., "change color") editing in 3D space. While results may not be perfect due to imperfect fine-grained feature alignment and non-optimal editing parameter tuning, the agent adaptively refines parameters and applies edits consistently across views and time—greatly reducing the need for manual tuning—and demonstrates robust, interactive 4D scene manipulation.
  • ...and 11 more figures