Table of Contents
Fetching ...

AgentVLN: Towards Agentic Vision-and-Language Navigation

Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, Shengjun Huang

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.

AgentVLN: Towards Agentic Vision-and-Language Navigation

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.
Paper Structure (19 sections, 7 equations, 5 figures, 3 tables)

This paper contains 19 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Performance comparison on the Val-Unseen split of RxR-CE dataset rxr. AgentVLN outperforms existing state-of-the-art models while utilizing a substantially smaller parameter footprint. Furthermore, owing to its lightweight architecture, AgentVLN supports real-time local inference on Jetson embedded edge boards, eliminating the reliance on remote cloud deployment.
  • Figure 2: Overview of the AgentVLN framework. AgentVLN employs a VLM-as-Brain paradigm, decomposing long-horizon navigation into modular skill executions. Additionally, a context-driven fine-grained strategy and QD-PCoT mitigate localization errors and scale ambiguities, ensuring precise 3D target grounding.
  • Figure 3: Visualization of AgentVLN's navigation. Green points represent the visual prompts generated by the perception-level skills, whereas red circles denote the navigation waypoints selected or predicted by the model. Notably, when traversing narrow passages or confronting severe visual occlusions, the model seamlessly outputs fine-grained atomic actions for trajectory fine-tuning. This demonstrates its robust capability to achieve highly accurate, collision-free navigation relying exclusively on egocentric observations.
  • Figure 4: Navigation results in real-world indoor and outdoor environments. Experimental results demonstrate that, regardless of whether the agent is navigating through complex, confined indoor spaces or outdoor scenarios with challenging illumination conditions, the proposed model consistently and accurately comprehends natural language instructions, enabling it to rapidly plan and execute precise navigation trajectories.
  • Figure 5: Ablation on the impact of different temporal context length.