Table of Contents
Fetching ...

Narrowing the Gap between Vision and Action in Navigation

Yue Zhang, Parisa Kordjamshidi

TL;DR

A low-level action decoder jointly trained with high-level action prediction is introduced, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls, and the current waypoint predictor is enhanced by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions.

Abstract

The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.

Narrowing the Gap between Vision and Action in Navigation

TL;DR

A low-level action decoder jointly trained with high-level action prediction is introduced, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls, and the current waypoint predictor is enhanced by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions.

Abstract

The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.
Paper Structure (18 sections, 5 equations, 6 figures, 5 tables)

This paper contains 18 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) In the VLN-CE task, instruction and panoramic views of the current navigation step are provided to the agent. (b) We have explored the waypoint predictor considering object semantics and their attributes related to passibility. The green circles are navigable viewpoints, and the red circles are obstacles. (c) We equip the navigator with a dual-action module containing both high-level and low-level actions. The black circles show the low-level action sequence.
  • Figure 2: Obstacle-Aware Waypoint Predictor. Given an RGB image, we mask obstacle objects based on semantic segmentation. The masked RGB image and depth image are then input to the waypoint predictor to generate navigable viewpoints. We also enhance the RGB visual encoder with pre-trained VL representations.
  • Figure 3: Main Architecture. The waypoint predictor first provides navigable viewpoints (green circle). Then, the corresponding RGB images, depth images, and textual instructions are input to our dual-action module, where the agent learns to select a high-level viewpoint and generate a low-level action sequence. The freezing sign indicates that the parameters are freezing during the training process. Please refer to Fig. \ref{['fig:low-level deooder']} for a detailed architecture of the low-level action decoder.
  • Figure 4: Low-Level Action Decoder.
  • Figure 5: Examples of generated low-level actions. $0$ denotes the current direction, while $-$ means LEFT turn. The number represents the rotation degree. The yellow bounding box indicates the target.
  • ...and 1 more figures