Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang
TL;DR
VLN-CE challenges arise when shortest-path supervision diverges from natural language instructions, especially off-path. The paper introduces language-aligned supervision (LAW), which guides agents toward nearest language-aligned waypoints along the reference path, with dense per-step signals and a new Waypoint Accuracy metric. A Cross-Modal Attention (CMA) model is trained using a two-stage regime (teacher forcing and DAgger) under language-aligned supervision, showing improved instruction following on VLN-CE and RxR-Habitat. Ablations indicate that LAW-based supervision consistently outperforms goal-oriented supervision and that denser LAW signals do not harm performance. The approach generalizes across datasets and offers an interpretable framework for following sub-instructions in continuous navigation tasks.
Abstract
In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.
