Table of Contents
Fetching ...

Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation

Francesco Taioli, Stefano Rosa, Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, Yiming Wang

TL;DR

This work tackles robustness gaps in Vision-and-Language Navigation in Continuous Environments (VLN-CE) by introducing a new benchmark, R2RIE-CE, that injects instruction errors reflecting real-world memory and comprehension mistakes. It formalizes the Detection and Localization of Instruction Errors and proposes IEDL, a cross-modal transformer that fuses instruction text with trajectory-visual observations to detect and pinpoint erroneous words in instructions. Empirical results show that current VLN-CE methods degrade substantially under instruction perturbations, while IEDL achieves superior detection (AUC) and localization (ATD) performance, and can reveal annotation errors in existing datasets such as R2R-CE and RxR-CE. The work highlights the importance of error-aware policies and provides resources to facilitate robust, reliable VLN-CE systems and cleaner benchmark data. Overall, the paper offers a concrete framework for recognizing and localizing instruction errors, with implications for improving navigation reliability in real-world, human-in-the-loop settings.

Abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE) is one of the most intuitive yet challenging embodied AI tasks. Agents are tasked to navigate towards a target goal by executing a set of low-level actions, following a series of natural language instructions. All VLN-CE methods in the literature assume that language instructions are exact. However, in practice, instructions given by humans can contain errors when describing a spatial environment due to inaccurate memory or confusion. Current VLN-CE benchmarks do not address this scenario, making the state-of-the-art methods in VLN-CE fragile in the presence of erroneous instructions from human users. For the first time, we propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes. This benchmark provides valuable insight into the robustness of VLN systems in continuous environments. We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark. Moreover, we formally define the task of Instruction Error Detection and Localization, and establish an evaluation protocol on top of our benchmark dataset. We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization, compared to baselines. Surprisingly, our proposed method has revealed errors in the validation set of the two commonly used datasets for VLN-CE, i.e., R2R-CE and RxR-CE, demonstrating the utility of our technique in other tasks. Code and dataset available at https://intelligolabs.github.io/R2RIE-CE

Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation

TL;DR

This work tackles robustness gaps in Vision-and-Language Navigation in Continuous Environments (VLN-CE) by introducing a new benchmark, R2RIE-CE, that injects instruction errors reflecting real-world memory and comprehension mistakes. It formalizes the Detection and Localization of Instruction Errors and proposes IEDL, a cross-modal transformer that fuses instruction text with trajectory-visual observations to detect and pinpoint erroneous words in instructions. Empirical results show that current VLN-CE methods degrade substantially under instruction perturbations, while IEDL achieves superior detection (AUC) and localization (ATD) performance, and can reveal annotation errors in existing datasets such as R2R-CE and RxR-CE. The work highlights the importance of error-aware policies and provides resources to facilitate robust, reliable VLN-CE systems and cleaner benchmark data. Overall, the paper offers a concrete framework for recognizing and localizing instruction errors, with implications for improving navigation reliability in real-world, human-in-the-loop settings.

Abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE) is one of the most intuitive yet challenging embodied AI tasks. Agents are tasked to navigate towards a target goal by executing a set of low-level actions, following a series of natural language instructions. All VLN-CE methods in the literature assume that language instructions are exact. However, in practice, instructions given by humans can contain errors when describing a spatial environment due to inaccurate memory or confusion. Current VLN-CE benchmarks do not address this scenario, making the state-of-the-art methods in VLN-CE fragile in the presence of erroneous instructions from human users. For the first time, we propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes. This benchmark provides valuable insight into the robustness of VLN systems in continuous environments. We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark. Moreover, we formally define the task of Instruction Error Detection and Localization, and establish an evaluation protocol on top of our benchmark dataset. We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization, compared to baselines. Surprisingly, our proposed method has revealed errors in the validation set of the two commonly used datasets for VLN-CE, i.e., R2R-CE and RxR-CE, demonstrating the utility of our technique in other tasks. Code and dataset available at https://intelligolabs.github.io/R2RIE-CE
Paper Structure (7 sections, 2 equations, 3 figures, 2 tables)

This paper contains 7 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An agent navigates in a scene, following instructions expressed in natural language, for example "Exit the bathroom and go left (✓ right), then turn left at the big clock and go into the bedroom and wait next to the bed." By just changing "right" to "left" in the instruction, the agent terminates the exploration in the wrong location, ignoring the fact that along the path it did not see the "big clock" (yellow arrow).
  • Figure 2: Comparison of the Success Rate (SR) of different methods (in order, an2023bevbertan2023etpnavdiscrete_to_contdiscrete_to_contkrantz_vlnce_2020krantz_vlnce_2020) working on continuous environmentsfn:sota_tab1. We show the SR on the standard R2R-CE dataset split Val Unseen (green) and the drop in SR performance when errors are present (red). Interestingly, we see up to $-25\%$ drop in SR when up to three errors among {Direction, Room, Object} per episode are present.
  • Figure 3: Architecture of our proposed IEDL model, representing the scenario depicted in Fig \ref{['fig:teaser']}. The frozen policy $\pi$ follows Instruction $\Upsilon$, producing a sequence of observation $\mathcal{O}$. Then, a panoramic encoder and a language encoder produce, respectively, the trajectory visual features $\Gamma$ and instruction features $\Upsilon$. We then feed the trajectory set $\Gamma$ and $\Upsilon$ to a cross-modal multi-layer transformer to produce visual-language aligned features. Finally, two specialized heads perform Instruction Error Detection and Instruction Error Localization, respectively.