Table of Contents
Fetching ...

Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang

TL;DR

VLN-CE challenges arise when shortest-path supervision diverges from natural language instructions, especially off-path. The paper introduces language-aligned supervision (LAW), which guides agents toward nearest language-aligned waypoints along the reference path, with dense per-step signals and a new Waypoint Accuracy metric. A Cross-Modal Attention (CMA) model is trained using a two-stage regime (teacher forcing and DAgger) under language-aligned supervision, showing improved instruction following on VLN-CE and RxR-Habitat. Ablations indicate that LAW-based supervision consistently outperforms goal-oriented supervision and that denser LAW signals do not harm performance. The approach generalizes across datasets and offers an interpretable framework for following sub-instructions in continuous navigation tasks.

Abstract

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

TL;DR

VLN-CE challenges arise when shortest-path supervision diverges from natural language instructions, especially off-path. The paper introduces language-aligned supervision (LAW), which guides agents toward nearest language-aligned waypoints along the reference path, with dense per-step signals and a new Waypoint Accuracy metric. A Cross-Modal Attention (CMA) model is trained using a two-stage regime (teacher forcing and DAgger) under language-aligned supervision, showing improved instruction following on VLN-CE and RxR-Habitat. Ablations indicate that LAW-based supervision consistently outperforms goal-oriented supervision and that denser LAW signals do not harm performance. The approach generalizes across datasets and offers an interpretable framework for following sub-instructions in continuous navigation tasks.

Abstract

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

Paper Structure

This paper contains 17 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: A language-aligned path (blue) in an instruction following task may differ from the shortest path (red) to the goal. Language-aligned supervision (blue arrows) encourages the agent at any given location (dark circles) to move towards the nearest waypoint on the language-aligned path and can hence be a better supervisory signal for instruction following than goal-oriented supervision (red arrows).
  • Figure 2: Top: The path from the start (orange) to the goal (green) with grey circle indicating LAWpano and the dashed segments indicating LAWstep. Bottom: We adapt the Cross-Modal Attention (CMA) model krantz2020beyond which predicts an action. We optimize the model using language-aligned supervision, which brings it back on the path toward the next waypoint.
  • Figure 3: Agent performance binned by nDTW value of reference path to shortest path ($95\%$ CI error bars) shows that LAWpano performs better than goal, especially on lower-range NDTW episodes. This indicates that language-aligned supervision is better suited for the instruction following task.
  • Figure 4: An example episode from R2R unseen split. The agent is able to learn to follow instruction better when supervised with language-aligned path (right) than the goal-oriented path (left). This is reflected in higher nDTW and waypoint accuracy (WA) metrics. Note that WA can be intuitively visualized and interpreted. We also show the mapping of sub-instructions to waypoints utilizing FG-R2R for this episode.
  • Figure 5: Plots showing a distribution of the number of R2R episodes across different nDTW values of reference path to shortest path for train, val-seen and val-unseen splits. There are many episodes for which the goal-oriented shortest path does not match the language-aligned path, as generated by the goal-oriented action sensor (top). We mitigate this problem by using language-aligned action sensor (bottom).
  • ...and 3 more figures