Table of Contents
Fetching ...

Trajectory-Diversity-Driven Robust Vision-and-Language Navigation

Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang, Zhiheng Ma, Yihong Gong

Abstract

Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.

Trajectory-Diversity-Driven Robust Vision-and-Language Navigation

Abstract

Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.
Paper Structure (25 sections, 19 equations, 5 figures, 13 tables, 1 algorithm)

This paper contains 25 sections, 19 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: Navigation behavior under early-stage perturbations in unfamiliar environments. The baseline IL agent struggles to recover from errors due to limited exposure to failed trajectories, often detouring or failing to reach the goal. Our NavGRPO agent learns from diverse rollouts, enabling robust error correction and successful navigation despite perturbations.
  • Figure 2: Overview of our NavGRPO training framework for vision-language navigation. For each instruction, we sample K diverse trajectories through policy rollout, compute rewards using trajectory-level and step-level signals, estimate group relative advantages by comparing within instruction groups, and optimize the policy through debiased advantage estimation without value networks.
  • Figure 3: Qualitative comparison on challenging instructions under normal conditions (top) and initial perturbations (bottom). ScaleVLN fails to recover from errors in both scenarios. Our GRPO-trained agent successfully completes tasks and demonstrates robust error correction under perturbations.
  • Figure 4: Qualitative comparison of navigation trajectories. Purple point indicates the starting location. NavGRPO's trajectory (green) successfully completes the instruction by navigating through the archway and reaching the target doorway, while ScaleVLN's trajectory (yellow) fails to execute the complete route, demonstrating NavGRPO's superior robustness against environmental complexity.
  • Figure 5: Failure case analysis. NavGRPO incorrectly interprets the spatial reference "second bedroom to your left" in an instruction with multiple ambiguous directional cues, highlighting challenges in compositional spatial reasoning.