Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

Haodong Hong; Sen Wang; Zi Huang; Qi Wu; Jiajun Liu

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

Haodong Hong, Sen Wang, Zi Huang, Qi Wu, Jiajun Liu

TL;DR

This work addresses the practical gap in Vision-and-Language Navigation by revealing the instruction-reality mismatch that arises from unexpected obstructions. It introduces R2R-UNO, a dataset that couples graph-level obstructions with visual obstructions, and proposes ObVLN, a curriculum-based method augmented with virtual graph nodes to enhance adaptability to obstructed environments. Through extensive experiments on R2R, REVERIE, and R2R-UNO, the authors demonstrate that standard VLN models struggle under obstructions, while ObVLN achieves state-of-the-art performance in obstructed settings and preserves robust behavior in unobstructed scenarios. The approach combines an object-insertion inpainting pipeline with a graph-construction mechanism and curriculum learning to enable detours and instruction-guidance recovery, offering a practical path toward obstruction-aware VLN systems.

Abstract

Real-world navigation often involves dealing with unexpected obstructions such as closed doors, moved objects, and unpredictable entities. However, mainstream Vision-and-Language Navigation (VLN) tasks typically assume instructions perfectly align with the fixed and predefined navigation graphs without any obstructions. This assumption overlooks potential discrepancies in actual navigation graphs and given instructions, which can cause major failures for both indoor and outdoor agents. To address this issue, we integrate diverse obstructions into the R2R dataset by modifying both the navigation graphs and visual observations, introducing an innovative dataset and task, R2R with UNexpected Obstructions (R2R-UNO). R2R-UNO contains various types and numbers of path obstructions to generate instruction-reality mismatches for VLN research. Experiments on R2R-UNO reveal that state-of-the-art VLN methods inevitably encounter significant challenges when facing such mismatches, indicating that they rigidly follow instructions rather than navigate adaptively. Therefore, we propose a novel method called ObVLN (Obstructed VLN), which includes a curriculum training strategy and virtual graph construction to help agents effectively adapt to obstructed environments. Empirical results show that ObVLN not only maintains robust performance in unobstructed scenarios but also achieves a substantial performance advantage with unexpected obstructions.

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 6 figures, 5 tables)

This paper contains 20 sections, 8 equations, 6 figures, 5 tables.

Introduction
Related work
Obstructed Environments
Problem Setup
R2R-UNO
Graph Changes
Visual Changes
Instruction-Reality Mismatches and Solution
Current VLN Methods Struggle in R2R-UNO
ObVLN
Experiments
Datasets and Evaluation Metrics
Implementation Details
Main Results
Ablation Study
...and 5 more sections

Figures (6)

Figure 1: Discrepancy between instructions and reality in real-world navigation. The instructions from humans are based on prior memory and often can not align with real-time environments. Current VLN environments overlook this mismatch, potentially causing navigation failure.
Figure 2: The overall framework of our method. We first generate obstructed environments based on existing datasets, and then train agents with our proposed curriculum strategy and graph construction mechanism on both data types.
Figure 3: The object insertion (left) and filtering module (right) in generating R2R-UNO. The red dot $\bullet$ marks the position of node B in the view of node A; the $\circ$ operator represents pixel-wise multiplication, while the $*$ symbol indicates pixel-wise matrix multiplication applied to image coordinates. The notation $j_{1:8}$ covers eight adjacent views ($j_1$ to $j_8$). The final images are highlighted in red. The dotted line from the score buffer illustrates the training process with all compatibility scores.
Figure 4: The large performance drop of current VLN methods in the validation unseen splits of R2R-UNO.
Figure 5: Qualitative analysis of inpainting results. Left: Original Matterport3D views. Middle: Results without filtering module. Right: R2R-UNO results. The red dash line denotes the mask contour.
...and 1 more figures

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

TL;DR

Abstract

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (6)