Table of Contents
Fetching ...

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, Xinyi Yu

TL;DR

This paper tackles the limitations of scene abstractions in zero-shot object navigation by introducing 3DGSNav, which uses 3D Gaussian Splatting as a persistent memory for vision-language models. Through active perception, memory-enabled free-viewpoint rendering, and structured prompts with Chain-of-Thought prompting, the approach enables long-horizon spatial reasoning and frontier-based exploration. The method jointly employs a real-time detector for efficiency and a VLM-driven re-verification module to robustly confirm target objects, with extensive evaluations on Habitat benchmarks and real-world quadruped experiments showing significant gains over state-of-the-art baselines. The results demonstrate that grounding VLM reasoning in a 3D, memory-rich representation can substantially improve navigation performance and reliability in unknown environments, with practical implications for embodied AI systems.

Abstract

Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

TL;DR

This paper tackles the limitations of scene abstractions in zero-shot object navigation by introducing 3DGSNav, which uses 3D Gaussian Splatting as a persistent memory for vision-language models. Through active perception, memory-enabled free-viewpoint rendering, and structured prompts with Chain-of-Thought prompting, the approach enables long-horizon spatial reasoning and frontier-based exploration. The method jointly employs a real-time detector for efficiency and a VLM-driven re-verification module to robustly confirm target objects, with extensive evaluations on Habitat benchmarks and real-world quadruped experiments showing significant gains over state-of-the-art baselines. The results demonstrate that grounding VLM reasoning in a 3D, memory-rich representation can substantially improve navigation performance and reliability in unknown environments, with practical implications for embodied AI systems.

Abstract

Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/
Paper Structure (50 sections, 21 equations, 24 figures, 7 tables)

This paper contains 50 sections, 21 equations, 24 figures, 7 tables.

Figures (24)

  • Figure 1: 3DGSNav Real-World Demonstration. The robot successfully navigates to locate the toilet. The reasoning process is simplified for clarity. The manipulator can move freely to support active perception.
  • Figure 2: System overview. The system builds navigation-oriented environment representations from robot poses and RGB-D observations via active perception. Free-viewpoint optimization and structured visual prompts guide VLM-based zero-shot navigation planning, while online object detection and viewpoint re-verification enable efficient target localization.
  • Figure 3: Visualization of the action-decision VLM target re-verification. The green circular bounding boxes indicate detections from the real-time detector, while the red bounding boxes denote the final detection results.
  • Figure 4: Comparison of self-explanations of Gemini3-Pro and Qwen3-235b-Thinking on ZSON tasks. The left and right images correspond to Gemini3-Pro and Qwen3-235b-Thinking, respectively. The target object is highlighted by a red bounding box.
  • Figure 5: Failure cause statistics. Distribution of different failure types.
  • ...and 19 more figures