Table of Contents
Fetching ...

General Scene Adaptation for Vision-and-Language Navigation

Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu

TL;DR

This work introduces General Scene Adaptation for Vision-and-Language Navigation (GSA-VLN), a setting where agents continually adapt to a specific scene by maintaining a memory of past observations and instructions and optionally updating parameters during navigation. To evaluate this, the authors construct GSA-R2R, a large, diverse dataset that includes both ID and OOD environments and introduces multiple speaking styles for instructions via a three-stage orchestration pipeline. They propose GR-DUET, a memory-based graph framework that uses a global topological graph across episodes and environment-specific training to preserve historical context and improve planning. Empirical results show GR-DUET outperforms existing VLN and adaptation methods across GSA-R2R splits, with notable gains in both environment- and instruction-adaptation settings, highlighting the practical potential of scene-aware memory graphs for real-world robotics navigation.

Abstract

Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN, a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of OOD data, and the limited number and style diversity of instructions for each scene. Therefore, we propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the R2R dataset to evaluate agent adaptability in both ID and OOD contexts. Furthermore, we design a three-stage instruction orchestration pipeline that leverages LLMs to refine speaker-generated instructions and apply role-playing techniques to rephrase instructions into different speaking styles. This is motivated by the observation that each individual user often has consistent signatures or preferences in their instructions. We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods. Based on our findings, we propose a novel method, GR-DUET, which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.

General Scene Adaptation for Vision-and-Language Navigation

TL;DR

This work introduces General Scene Adaptation for Vision-and-Language Navigation (GSA-VLN), a setting where agents continually adapt to a specific scene by maintaining a memory of past observations and instructions and optionally updating parameters during navigation. To evaluate this, the authors construct GSA-R2R, a large, diverse dataset that includes both ID and OOD environments and introduces multiple speaking styles for instructions via a three-stage orchestration pipeline. They propose GR-DUET, a memory-based graph framework that uses a global topological graph across episodes and environment-specific training to preserve historical context and improve planning. Empirical results show GR-DUET outperforms existing VLN and adaptation methods across GSA-R2R splits, with notable gains in both environment- and instruction-adaptation settings, highlighting the practical potential of scene-aware memory graphs for real-world robotics navigation.

Abstract

Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN, a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of OOD data, and the limited number and style diversity of instructions for each scene. Therefore, we propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the R2R dataset to evaluate agent adaptability in both ID and OOD contexts. Furthermore, we design a three-stage instruction orchestration pipeline that leverages LLMs to refine speaker-generated instructions and apply role-playing techniques to rephrase instructions into different speaking styles. This is motivated by the observation that each individual user often has consistent signatures or preferences in their instructions. We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods. Based on our findings, we propose a novel method, GR-DUET, which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.

Paper Structure

This paper contains 63 sections, 4 equations, 16 figures, 21 tables.

Figures (16)

  • Figure 1: Comparison between the traditional VLN task and the proposed GSA-VLN task. Traditional VLN agents can only execute fixed-style instructions with frozen parameters, remaining unfamiliar with the environment like the laboratory even after extended use. In contrast, GSA-VLN enables agents to dynamically update parameters, leverage long-term history from the memory bank, and quickly adapt to both the environment and varying instruction styles from different users.
  • Figure 2: Left: Building type counts in R2R and GSA-R2R. Right: Comparison of buildings in R2R and GSA-R2R. Unlike R2R, where evaluation scenes are similar to the training set, GSA-R2R includes a more diverse mix of both in-distribution (ID) and out-of-distribution (OOD) data.
  • Figure 3: The generation procedure of the GSA-R2R dataset. Up: environment selection. Bottom: the three-stage instruction orchestration pipeline.
  • Figure 4: The t-SNE analysis of instructions from R2R (left) and GSA-R2R (right). Our instructions demonstrate significantly greater diversity compared to R2R and include OOD data.
  • Figure 5: Comparison of instructions between R2R and various speaking styles in GSA-R2R. Words that represent the speaking style are underlined. Our instructions demonstrate significantly greater diversity and distinctiveness in speaking styles.
  • ...and 11 more figures