Table of Contents
Fetching ...

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Jiwen Zhang, Xiangyu Shi, Siyuan Wang, Zerui Li, Zhongyu Wei, Qi Wu

Abstract

Vision-and-Language Navigation (VLN) has recently benefited from Multimodal Large Language Models (MLLMs), enabling zero-shot navigation. While recent exploration-based zero-shot methods have shown promising results by leveraging global scene priors, they rely on high-quality human-crafted scene reconstructions, which are impractical for real-world robot deployment. When encountering an unseen environment, a robot should build its own priors through pre-exploration. However, these self-built reconstructions are inevitably incomplete and noisy, which severely degrade methods that depend on high-quality scene reconstructions. To address these issues, we propose SpatialAnt, a zero-shot navigation framework designed to bridge the gap between imperfect self-reconstructions and robust execution. SpatialAnt introduces a physical grounding strategy to recover the absolute metric scale for monocular-based reconstructions. Furthermore, rather than treating the noisy self-reconstructed scenes as absolute spatial references, we propose a novel visual anticipation mechanism. This mechanism leverages the noisy point clouds to render future observations, enabling the agent to perform counterfactual reasoning and prune paths that contradict human instructions. Extensive experiments in both simulated and real-world environments demonstrate that SpatialAnt significantly outperforms existing zero-shot methods. We achieve a 66% Success Rate (SR) on R2R-CE and 50.8% SR on RxR-CE benchmarks. Physical deployment on a Hello Robot further confirms the efficiency and efficacy of our framework, achieving a 52% SR in challenging real-world settings.

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Abstract

Vision-and-Language Navigation (VLN) has recently benefited from Multimodal Large Language Models (MLLMs), enabling zero-shot navigation. While recent exploration-based zero-shot methods have shown promising results by leveraging global scene priors, they rely on high-quality human-crafted scene reconstructions, which are impractical for real-world robot deployment. When encountering an unseen environment, a robot should build its own priors through pre-exploration. However, these self-built reconstructions are inevitably incomplete and noisy, which severely degrade methods that depend on high-quality scene reconstructions. To address these issues, we propose SpatialAnt, a zero-shot navigation framework designed to bridge the gap between imperfect self-reconstructions and robust execution. SpatialAnt introduces a physical grounding strategy to recover the absolute metric scale for monocular-based reconstructions. Furthermore, rather than treating the noisy self-reconstructed scenes as absolute spatial references, we propose a novel visual anticipation mechanism. This mechanism leverages the noisy point clouds to render future observations, enabling the agent to perform counterfactual reasoning and prune paths that contradict human instructions. Extensive experiments in both simulated and real-world environments demonstrate that SpatialAnt significantly outperforms existing zero-shot methods. We achieve a 66% Success Rate (SR) on R2R-CE and 50.8% SR on RxR-CE benchmarks. Physical deployment on a Hello Robot further confirms the efficiency and efficacy of our framework, achieving a 52% SR in challenging real-world settings.

Paper Structure

This paper contains 31 sections, 3 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The reality gap for pre-exploration based zero-shot VLN agents. Previous works adopt the ideal scene reconstructions provided by existing datasets chang2017matterport3d, treating these global priors as accurate maps, and are sensitive to noise. Instead, we use the real robot reconstructed noisy scenes and propose a visual anticipation approach to transform the noisy scene into supportable sub-paths.
  • Figure 2: Overview of our proposed SpatialAnt framework. We firstly let the agent to pre-explore the environment, gathering the coordinates of navigable places. Then, we use these information to plan visibility-aware tours to collect geometrically coherent RGB videos and IMU traces for subsequent regional point cloud reconstruction. The reconstructed point clouds are then aligned with physical scale and merged into a global point cloud. Afterall, this global point cloud provides extra global context by a novel visual anticipation mechanism for MLLM to conduct sub-path based high-level reasoning.
  • Figure 3: Comparison across different MLLM backbones. We report the results on sampled subset of R2R-CE. The default HFOV of visual anticipation (VA) is set as 90$^\circ$.
  • Figure 4: Representative cases in real environments. We show frames of the agent egocentric front view, panoramic observations (from right to left), and candidate actions on the spatial map together with our anticipated views.