Table of Contents
Fetching ...

FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents

Xin Tan, Yuzhou Ji, He Zhu, Yuan Xie

TL;DR

FMLGS tackles the challenge of part-level open-vocabulary localization in 3D radiance fields by introducing a multilevel, SAM2-assisted pipeline that extracts and semantically deviates object- and part-level features, maps them across views, and trains Gaussian-based features for fast, pixel-aligned querying. The method achieves state-of-the-art speed and accuracy on open-vocabulary localization and supports interactive agents capable of navigating scenes and responding to natural language prompts. Key innovations include semantic deviation to resolve language ambiguity, identity-based cross-view feature mapping, and a two-step multilevel localization strategy. These contributions enable practical applications in language-driven 3D segmentation and object inpainting, with potential impact on embodied AI and interactive scene understanding.

Abstract

The semantically interactive radiance field has long been a promising backbone for 3D real-world applications, such as embodied AI to achieve scene understanding and manipulation. However, multi-granularity interaction remains a challenging task due to the ambiguity of language and degraded quality when it comes to queries upon object components. In this work, we present FMLGS, an approach that supports part-level open-vocabulary query within 3D Gaussian Splatting (3DGS). We propose an efficient pipeline for building and querying consistent object- and part-level semantics based on Segment Anything Model 2 (SAM2). We designed a semantic deviation strategy to solve the problem of language ambiguity among object parts, which interpolates the semantic features of fine-grained targets for enriched information. Once trained, we can query both objects and their describable parts using natural language. Comparisons with other state-of-the-art methods prove that our method can not only better locate specified part-level targets, but also achieve first-place performance concerning both speed and accuracy, where FMLGS is 98 x faster than LERF, 4 x faster than LangSplat and 2.5 x faster than LEGaussians. Meanwhile, we further integrate FMLGS as a virtual agent that can interactively navigate through 3D scenes, locate targets, and respond to user demands through a chat interface, which demonstrates the potential of our work to be further expanded and applied in the future.

FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents

TL;DR

FMLGS tackles the challenge of part-level open-vocabulary localization in 3D radiance fields by introducing a multilevel, SAM2-assisted pipeline that extracts and semantically deviates object- and part-level features, maps them across views, and trains Gaussian-based features for fast, pixel-aligned querying. The method achieves state-of-the-art speed and accuracy on open-vocabulary localization and supports interactive agents capable of navigating scenes and responding to natural language prompts. Key innovations include semantic deviation to resolve language ambiguity, identity-based cross-view feature mapping, and a two-step multilevel localization strategy. These contributions enable practical applications in language-driven 3D segmentation and object inpainting, with potential impact on embodied AI and interactive scene understanding.

Abstract

The semantically interactive radiance field has long been a promising backbone for 3D real-world applications, such as embodied AI to achieve scene understanding and manipulation. However, multi-granularity interaction remains a challenging task due to the ambiguity of language and degraded quality when it comes to queries upon object components. In this work, we present FMLGS, an approach that supports part-level open-vocabulary query within 3D Gaussian Splatting (3DGS). We propose an efficient pipeline for building and querying consistent object- and part-level semantics based on Segment Anything Model 2 (SAM2). We designed a semantic deviation strategy to solve the problem of language ambiguity among object parts, which interpolates the semantic features of fine-grained targets for enriched information. Once trained, we can query both objects and their describable parts using natural language. Comparisons with other state-of-the-art methods prove that our method can not only better locate specified part-level targets, but also achieve first-place performance concerning both speed and accuracy, where FMLGS is 98 x faster than LERF, 4 x faster than LangSplat and 2.5 x faster than LEGaussians. Meanwhile, we further integrate FMLGS as a virtual agent that can interactively navigate through 3D scenes, locate targets, and respond to user demands through a chat interface, which demonstrates the potential of our work to be further expanded and applied in the future.

Paper Structure

This paper contains 22 sections, 6 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Results of querying "A Button of Xbox Wireless Controller". While LangSplat qin2023langsplat fails in detailed part-level localization, our method provides an accurate outcome.
  • Figure 2: FMLGS pipeline. Left: Initialization for masks and features. Mid: Feature mapping of object-level embedding and deviated part-level embedding through consistent identities for training and restoring. Right: Query using open-vocabulary prompts, supporting agent integration.
  • Figure 3: Examples of environment-sensitive agents of four different cases.
  • Figure 4: Agent execution framework. Up: two-stage nested loop with the outer loop issuing main task after accepting user input, and the inner loop disassemble single task to executable task sequence. Down: function modules for different subtasks to be called by the inner loop execution.
  • Figure 5: Illustration of dilating keypoints to generate more view anchors in the scene.
  • ...and 4 more figures