Table of Contents
Fetching ...

Scaling Vision-and-Language Navigation With Offline RL

Valay Bundele, Mahesh Bhupati, Biplab Banerjee, Aditya Grover

TL;DR

This work addresses data inefficiency and safety in Vision-Language Navigation by introducing VLN-ORL, enabling agents to learn from large volumes of suboptimal offline trajectories. The core idea is reward-token conditioning, with dense and sparse variants of a displacement-based reward $\\delta D$, integrated into VLN architectures (VLN\\circlearrowrightBERT-ORL and MTVM-ORL). The authors create offline VLN benchmarks (D-R2R, D-RxR) with noise models and demonstrate that reward-conditioned policies substantially improve navigation success and robustness across R2R and RxR, often outperforming return-conditioned baselines, especially under high suboptimality and noise. The contributions include a practical reward-token framework, first offline VLN benchmarks in 3D environments, and strong empirical evidence that conditioning on progress rewards yields safer, more effective learning from suboptimal data with limited or no online exploration.

Abstract

The study of vision-and-language navigation (VLN) has typically relied on expert trajectories, which may not always be available in real-world situations due to the significant effort required to collect them. On the other hand, existing approaches to training VLN agents that go beyond available expert data involve data augmentations or online exploration which can be tedious and risky. In contrast, it is easy to access large repositories of suboptimal offline trajectories. Inspired by research in offline reinforcement learning (ORL), we introduce a new problem setup of VLN-ORL which studies VLN using suboptimal demonstration data. We introduce a simple and effective reward-conditioned approach that can account for dataset suboptimality for training VLN agents, as well as benchmarks to evaluate progress and promote research in this area. We empirically study various noise models for characterizing dataset suboptimality among other unique challenges in VLN-ORL and instantiate it for the VLN$\circlearrowright$BERT and MTVM architectures in the R2R and RxR environments. Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements, even in complex and intricate environments.

Scaling Vision-and-Language Navigation With Offline RL

TL;DR

This work addresses data inefficiency and safety in Vision-Language Navigation by introducing VLN-ORL, enabling agents to learn from large volumes of suboptimal offline trajectories. The core idea is reward-token conditioning, with dense and sparse variants of a displacement-based reward , integrated into VLN architectures (VLN\\circlearrowrightBERT-ORL and MTVM-ORL). The authors create offline VLN benchmarks (D-R2R, D-RxR) with noise models and demonstrate that reward-conditioned policies substantially improve navigation success and robustness across R2R and RxR, often outperforming return-conditioned baselines, especially under high suboptimality and noise. The contributions include a practical reward-token framework, first offline VLN benchmarks in 3D environments, and strong empirical evidence that conditioning on progress rewards yields safer, more effective learning from suboptimal data with limited or no online exploration.

Abstract

The study of vision-and-language navigation (VLN) has typically relied on expert trajectories, which may not always be available in real-world situations due to the significant effort required to collect them. On the other hand, existing approaches to training VLN agents that go beyond available expert data involve data augmentations or online exploration which can be tedious and risky. In contrast, it is easy to access large repositories of suboptimal offline trajectories. Inspired by research in offline reinforcement learning (ORL), we introduce a new problem setup of VLN-ORL which studies VLN using suboptimal demonstration data. We introduce a simple and effective reward-conditioned approach that can account for dataset suboptimality for training VLN agents, as well as benchmarks to evaluate progress and promote research in this area. We empirically study various noise models for characterizing dataset suboptimality among other unique challenges in VLN-ORL and instantiate it for the VLNBERT and MTVM architectures in the R2R and RxR environments. Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements, even in complex and intricate environments.
Paper Structure (17 sections, 22 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 22 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of proposed setup and algorithmic framework for VLN-ORL. The proposed setup involves training the agent on a dataset primarily comprised of suboptimal demonstrations of varying lengths (middle two rows). Unlike conventional VLN setups that rely solely on expert trajectories (top row), we condition our agent on greedy rewards during training to capture the degree of suboptimality without imposing excessive assumptions on the environment. Positive rewards guide the agent towards the goal, negative rewards steer it away, and zero rewards prompt the agent to halt its movement. During testing, we can condition the agent on optimal greedy rewards for executing new instructions successfully (last row).
  • Figure 2: Performance comparison of VLN$\circlearrowright$BERT and reward-conditioned VLN$\circlearrowright$BERT on the validation sets (a, b) based on varying sizes of the training subset from the 30% Noisy R2R dataset (c, d) as a function of the level of noise in the proposed R2R training dataset.
  • Figure 3: Visualisation of panoramic views and headings (view in heading direction) at every step for agents trained on the 30% Noisy R2R dataset. The reward-conditioned agent correctly follows the instruction to reach the goal whereas the baseline agent takes the wrong path and reaches elsewhere. Instead of moving forward with the sink to the left and stove to right (indicated by green box) the baseline agent goes in the other direction to the hall (indicated by red box). It continues in that direction and eventually exits the house. Conversely, the reward-conditioned agent correctly navigates the route, exiting the kitchen and entering the bedroom as instructed.
  • Figure 4: Reward token conditioning in VLN$\circlearrowright$BERT-ORL. Initially, the language instruction is encoded by the self-attention module. During navigation, the sequence of state and visual tokens along with instruction features is passed multiple times through the cross-attention and self-attention modules to infer the action prediction probabilities.
  • Figure 5: Visualisation of panoramic views and headings of VLN$\circlearrowright$BERT-ORL and VLN$\circlearrowright$BERT at every step in the trajectory.
  • ...and 3 more figures