Table of Contents
Fetching ...

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Hao Tan, Licheng Yu, Mohit Bansal

TL;DR

The paper addresses the challenge of instructional navigation in unseen environments by introducing a two-stage training regime that first combines imitation learning and reinforcement learning (IL+RL) for robust policy learning, then employs semi-supervised back translation with environmental dropout to synthesize diverse unseen-environment triplets. Environmental dropout simulates new environments by masking image features in a coordinated way, enabling effective fine-tuning on previously unlabeled data. Empirically, the approach achieves state-of-the-art performance on the Room-to-Room VLN task, ranking first on private unseen test data across single-run, beam-search, and pre-exploration settings, and shows substantial gains in ablation studies. The work demonstrates that expanding environmental variability during training via dropout-based environment augmentation and back-translation substantially improves generalization for vision-and-language navigation systems with real-world impact for robust robotic navigation.

Abstract

A grand goal in AI is to build a robot that can accurately navigate based on natural language instructions, which requires the agent to perceive the scene, understand and ground language, and act in the real-world environment. One key challenge here is to learn to navigate in new environments that are unseen during training. Most of the existing approaches perform dramatically worse in unseen environments as compared to seen ones. In this paper, we present a generalizable navigational agent. Our agent is trained in two stages. The first stage is training via mixed imitation and reinforcement learning, combining the benefits from both off-policy and on-policy optimization. The second stage is fine-tuning via newly-introduced 'unseen' triplets (environment, path, instruction). To generate these unseen triplets, we propose a simple but effective 'environmental dropout' method to mimic unseen environments, which overcomes the problem of limited seen environment variability. Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions. Empirically, we show that our agent is substantially better at generalizability when fine-tuned with these triplets, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

TL;DR

The paper addresses the challenge of instructional navigation in unseen environments by introducing a two-stage training regime that first combines imitation learning and reinforcement learning (IL+RL) for robust policy learning, then employs semi-supervised back translation with environmental dropout to synthesize diverse unseen-environment triplets. Environmental dropout simulates new environments by masking image features in a coordinated way, enabling effective fine-tuning on previously unlabeled data. Empirically, the approach achieves state-of-the-art performance on the Room-to-Room VLN task, ranking first on private unseen test data across single-run, beam-search, and pre-exploration settings, and shows substantial gains in ablation studies. The work demonstrates that expanding environmental variability during training via dropout-based environment augmentation and back-translation substantially improves generalization for vision-and-language navigation systems with real-world impact for robust robotic navigation.

Abstract

A grand goal in AI is to build a robot that can accurately navigate based on natural language instructions, which requires the agent to perceive the scene, understand and ground language, and act in the real-world environment. One key challenge here is to learn to navigate in new environments that are unseen during training. Most of the existing approaches perform dramatically worse in unseen environments as compared to seen ones. In this paper, we present a generalizable navigational agent. Our agent is trained in two stages. The first stage is training via mixed imitation and reinforcement learning, combining the benefits from both off-policy and on-policy optimization. The second stage is fine-tuning via newly-introduced 'unseen' triplets (environment, path, instruction). To generate these unseen triplets, we propose a simple but effective 'environmental dropout' method to mimic unseen environments, which overcomes the problem of limited seen environment variability. Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions. Empirically, we show that our agent is substantially better at generalizability when fine-tuned with these triplets, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.

Paper Structure

This paper contains 37 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Room-to-Room Task. The agent is given an instruction, then starts its navigation from some staring viewpoint inside the given environment. At time $t$, the agent selects one view (highlighted in red) from a set of its surrounding panoramic views to step into, as an action $a_t$.
  • Figure 2: Left: IL+RL supervised learning (stage 1). Right: Semi-supervised learning with back translation and environmental dropout (stage 2).
  • Figure 3: Comparison of the two dropout methods (based on an illustration on an RGB image).
  • Figure 4: Comparison of the two dropout methods (based on image features).
  • Figure 5: Success rates of agents trained with different amounts of data. X-axis in log-scale. The blue line represents the growth of results by gradually adding new environments to the supervised training method. The red line is trained with the same amounts of data as the blue line, but the data is randomly selected from all $60$ training environments. The dashed lines are predicted.
  • ...and 1 more figures