Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

Luo Ling; Bai Qianqian

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

Luo Ling, Bai Qianqian

TL;DR

This work tackles the gap in real-world embodied Vision-and-Language Navigation (VLN) where agents suffer spatial hallucinations when transferring from simulation. It introduces BrainNav, a bio-inspired hierarchical system with five modules—hippocampal memory, visual cortex perception, parietal space builder, prefrontal decision center, and cerebellar motion execution unit—and a dual-map dual-orientation framework to maintain robust spatial cognition in continuous environments. The method achieves zero-shot real-world success on a Limo Pro robot without fine-tuning and outperforms state-of-the-art VLN-CE baselines, validating its effectiveness and practicality. By integrating biological spatial cognition principles with online perception and planning, BrainNav enhances adaptability and reduces spatial hallucinations in dynamic indoor settings, offering a path toward more reliable real-world embodied navigation.

Abstract

Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

TL;DR

Abstract

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)