Table of Contents
Fetching ...

Learning Goal-Oriented Vision-and-Language Navigation with Self-Improving Demonstrations at Scale

Songze Li, Zun Wang, Gengze Zhou, Jialu Li, Xiangyu Zeng, Ziyang Gong, Limin Wang, Yu Qiao, Qi Wu, Mohit Bansal, Yi Wang

Abstract

Goal-oriented vision-language navigation requires robust exploration capabilities for agents to navigate to specified goals in unknown environments without step-by-step instructions. Existing methods tend to exclusively utilize shortest-path trajectories, lacking effective exploration priors for training navigation agents. To address the above challenges, we present SID, a goal-oriented vision-and-language navigation learning approach with Self-Improving Demonstrations. Specifically, SID learns an initial agent on the shortest-path data sampled from environments and then leverages this agent to generate novel exploration trajectories. The novel rollouts provide demonstrations with stronger exploration strategies to train a better agent, which in turn produces higher-quality agent demonstrations for the next round of training. We show that this iterative self-improving pipeline readily scales to new environments, and the resulting demonstrations are highly transferable, elevating the performance ceiling across a variety of vision-and-language navigation tasks. Extensive experiments demonstrate that SID significantly boosts the exploration capabilities and generalization of navigation agents. The resulting agent achieves new state-of-the-art performance on goal-oriented vision-and-language navigation benchmarks, including REVERIE, SOON as well as strong transferability to object-goal navigation and VLN-CE. It notably achieves a 50.9% success rate on the unseen validation splits of SOON, surpassing prior leading approaches by a margin of 13.9%.

Learning Goal-Oriented Vision-and-Language Navigation with Self-Improving Demonstrations at Scale

Abstract

Goal-oriented vision-language navigation requires robust exploration capabilities for agents to navigate to specified goals in unknown environments without step-by-step instructions. Existing methods tend to exclusively utilize shortest-path trajectories, lacking effective exploration priors for training navigation agents. To address the above challenges, we present SID, a goal-oriented vision-and-language navigation learning approach with Self-Improving Demonstrations. Specifically, SID learns an initial agent on the shortest-path data sampled from environments and then leverages this agent to generate novel exploration trajectories. The novel rollouts provide demonstrations with stronger exploration strategies to train a better agent, which in turn produces higher-quality agent demonstrations for the next round of training. We show that this iterative self-improving pipeline readily scales to new environments, and the resulting demonstrations are highly transferable, elevating the performance ceiling across a variety of vision-and-language navigation tasks. Extensive experiments demonstrate that SID significantly boosts the exploration capabilities and generalization of navigation agents. The resulting agent achieves new state-of-the-art performance on goal-oriented vision-and-language navigation benchmarks, including REVERIE, SOON as well as strong transferability to object-goal navigation and VLN-CE. It notably achieves a 50.9% success rate on the unseen validation splits of SOON, surpassing prior leading approaches by a margin of 13.9%.

Paper Structure

This paper contains 39 sections, 2 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: SID-VLN generates over 46M self-improving exploration trajectories across 860 environments, exhibiting strong generalization across diverse goal modalities.
  • Figure 2: Comparison of three goal-oriented VLN training paradigms. Learning exploration from imitating human demonstrations is costly and difficult to scale up, while learning general navigation from large-scale instruction augmentation on shortest-paths lacks exploration demonstrations. In contrast, SID-VLN provides large-scale demonstrations on exploration strategies in an iterative self-improving approach, eliminating the reliance on costly human demonstrations.
  • Figure 3: Our proposed Self-Improving Demonstrations paradigm for goal-oriented VLN. We learn an initial navigation agent using trajectories sampled from MP3D, generate new paths using this agent, and reserve the successful exploration ones. These trajectories give demonstrations on the exploration strategies, resulting in a more capable agent. This iterative semi-supervised learning can gradually improve navigation agent's performance ceiling and produce effective exploration trajectories at scale, which can be transferred with caption augmentation for goal-oriented VLN.
  • Figure 4: Filtering the trajectories generated by the navigation agent. The agent may fail in various scenarios, such as terminating at similar but incorrect targets or exceeding the path length limitation. Only the trajectories that successfully reach the correct target with efficient exploration will be retained for subsequent iterations.
  • Figure 5: Prompt and Model Output of the Detail-style Captions.
  • ...and 3 more figures