Table of Contents
Fetching ...

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee

TL;DR

The paper addresses the gap between language-guided navigation and real-world robotic control by introducing VLN-CE, a continuous 3D navigation benchmark that uses low-level actions and egocentric RGB-D perception. It transfers nav-graph Room-to-Room trajectories into continuous Matterport3D environments within Habitat, and develops two models—Seq2Seq and a Cross-Modal Attention architecture—with training regimes including imitation learning, DAgger, and synthetic data augmentation. Experiments reveal a substantial performance drop in VLN-CE relative to nav-graph benchmarks, with the best model achieving roughly 0.30 SPL and around 32% success in unseen environments, underscoring the bias of nav-graph priors. The work provides dataset, code, and insights into integrating high-level instructions with low-level embodied control, establishing VLN-CE as a platform for studying the interplay between planning, perception, and actuation in realistic robotics settings.

Abstract

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines. While some of these techniques transfer, we find significantly lower absolute performance in the continuous setting -- suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

TL;DR

The paper addresses the gap between language-guided navigation and real-world robotic control by introducing VLN-CE, a continuous 3D navigation benchmark that uses low-level actions and egocentric RGB-D perception. It transfers nav-graph Room-to-Room trajectories into continuous Matterport3D environments within Habitat, and develops two models—Seq2Seq and a Cross-Modal Attention architecture—with training regimes including imitation learning, DAgger, and synthetic data augmentation. Experiments reveal a substantial performance drop in VLN-CE relative to nav-graph benchmarks, with the best model achieving roughly 0.30 SPL and around 32% success in unseen environments, underscoring the bias of nav-graph priors. The work provides dataset, code, and insights into integrating high-level instructions with low-level embodied control, establishing VLN-CE as a platform for studying the interplay between planning, perception, and actuation in realistic robotics settings.

Abstract

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines. While some of these techniques transfer, we find significantly lower absolute performance in the continuous setting -- suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The VLN setting (a) operates on a fixed topology of panoramic images (shown in blue) -- assuming perfect navigation between nodes (often meters apart) and precise localization. Our VLN-CE setting (b) lifts these assumptions by instantiating the task in continuous environments with low-level actions -- providing a more realistic testbed for robot instruction following.
  • Figure 2: We transfer nav-graph trajectories over panoramas (blue dots) from the Room-to-Room (R2R) dataset to locations in reconstructed Matterport3D (MP3D) environments. Some map to 'holes' in environment meshes where reconstruction failed or on furniture (commonly tables) where an agent could not navigate. For these, we find the nearest navigable point within 0.5m.
  • Figure 3: We successfully transfer 77% of the R2R trajectories. (a) Most panorama nodes transfer directly, but 3% require horizontal adjustment -- with an average displacement of 0.19m. (b) Despite this, some trajectories are not navigable because of differences between the panoramas and reconstructed environments, e.g. holes in the reconstructed mesh (top) or objects like chairs being manipulated between panorama captures (bottom). (c) Our setting requires significantly more agent decisions per trajectory with an average action length of 55.88 compared to 5 in R2R.
  • Figure 4: We develop a simple baseline agent (a) as well as an attentional agent (b) comparable to that in wang2019reinforced. Both receive RGB and depth frames represented by pretrained networks for image classification imagenet_cvpr09 and point-goal navigation wijmans2019dd, respectively.
  • Figure 5: Qualitative examples of our Cross Modal Attention model taken in unseen validation environments. In the first example our agent successfully follows the instruction -- note it takes 62 actions in VLN-CE, whereas the VLN traversal requires just 3 hops.
  • ...and 1 more figures