Table of Contents
Fetching ...

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

Zihao Xin, Wentong Li, Yixuan Jiang, Bin Wang, Runming Cong, Jie Qin, Shengjun Huang

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
Paper Structure (31 sections, 5 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 5 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: The framework of DecoVLN. DecoVLN decouples the agent's observation and reasoning processes. The agent can perceive the environment continuously while in motion and, based on the Adaptive Memory Refinement (AMR) mechanism, it filters and stores high-information-density state representations into a memory bank. During the generation phase, the LLM outputs the action chunk which is comprising multiple consecutive actions-based on the input instruction, the current frame, and the memory bank. Subsequently, we construct an error-correction strategy based on state-action pairs. The model autonomously explores according to the instruction and collects State-Action Pairs within a trusted region for error-correction fine-tuning. This process not only enhances data utilization efficiency but also equips the model with introspective and self-correction capabilities.
  • Figure 2: Evaluate success rates for various $K$ values and hyper-parameters on R2R Val-Unseen.
  • Figure 3: Comparison of Uniform Sampling strategy and Adaptive Memory Refinement mechanism. The history obtained via the uniform sampling strategy captures a large number of instruction-irrelevant images, such as walls and corners. This irrelevant semantic information severely impacts the model's reasoning performance. In contrast, the Adaptive Memory Refinement mechanism can effectively extract the key navigation points (indicated in red) from the instruction to achieve efficient and accurate navigation.
  • Figure 4: The robot accurately follows complex natural language instructions involving spatial reasoning and object grounding, demonstrating robust performance and strong sim-to-real generalization under challenging conditions.
  • Figure 5: Performance comparison on long-horizon navigation validation set.
  • ...and 3 more figures