Table of Contents
Fetching ...

Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation

Muraleekrishna Gopinathan, Martin Masek, Jumana Abu-Khalaf, David Suter

TL;DR

The paper tackles the problem of generating high-quality, diverse navigational instructions for Vision-Language Navigation by introducing Spatially-Aware Speaker (SAS). SAS is an encoder-decoder that fuses Action Encoding, Structural Encoding, and Semantic Encoding with Panoramic Room-Object Attention to produce instructions conditioned on a trajectory, and it is trained with a combination of adversarial reward learning and supervision. A data-augmentation strategy (Path Mixing) and a reversible reward-learning framework (ARL) are proposed to mitigate exposure bias and metric gaming, with objective terms including $L_{LM}$, $L_{ULS}$, and $L_{TAL}$ guiding learning. On VLN datasets R2R and R4R, SAS achieves substantial improvements in SPICE and other metrics, demonstrating richer references to landmarks and actions, and highlighting the practical value of incorporating spatial and semantic cues into instruction generation. The work suggests future directions toward integrating multimodal transformers to further enhance open-vocabulary instruction generation and robustness to dataset limitations.

Abstract

Embodied AI aims to develop robots that can \textit{understand} and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or \textit{Speaker} model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an adversarial setting to avoid systematic bias introduced by language evaluation metrics. Empirically, our method outperforms existing instruction generation models, evaluated using standard metrics. Our code is available at \url{https://github.com/gmuraleekrishna/SAS}.

Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation

TL;DR

The paper tackles the problem of generating high-quality, diverse navigational instructions for Vision-Language Navigation by introducing Spatially-Aware Speaker (SAS). SAS is an encoder-decoder that fuses Action Encoding, Structural Encoding, and Semantic Encoding with Panoramic Room-Object Attention to produce instructions conditioned on a trajectory, and it is trained with a combination of adversarial reward learning and supervision. A data-augmentation strategy (Path Mixing) and a reversible reward-learning framework (ARL) are proposed to mitigate exposure bias and metric gaming, with objective terms including , , and guiding learning. On VLN datasets R2R and R4R, SAS achieves substantial improvements in SPICE and other metrics, demonstrating richer references to landmarks and actions, and highlighting the practical value of incorporating spatial and semantic cues into instruction generation. The work suggests future directions toward integrating multimodal transformers to further enhance open-vocabulary instruction generation and robustness to dataset limitations.

Abstract

Embodied AI aims to develop robots that can \textit{understand} and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or \textit{Speaker} model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an adversarial setting to avoid systematic bias introduced by language evaluation metrics. Empirically, our method outperforms existing instruction generation models, evaluated using standard metrics. Our code is available at \url{https://github.com/gmuraleekrishna/SAS}.
Paper Structure (51 sections, 14 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 51 sections, 14 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: Extracting 3D scene relationships from house environments (a,b) can improve instruction generation by including object references (c).
  • Figure 2: Path Mixing (PM) using fine-grained paths from R2R dataset. Original paths $\rightarrow$and $\rightarrow$are mixed to generate $\rightarrow$.
  • Figure 3: Adversarial training of SAS model. SAS learns to generate instruction, while reward model learns the reward function from ground truth data. The learned reward function is employed to optimise the policy
  • Figure 4: An example of a trajectory and the corresponding generated instruction using SAS$_{ARL+TF}$ model.
  • Figure 5: Unique instruction words present in R2R dataset splits. (a) common words between splits, (b) shows the ratio of number of different words to number of common words in between the splits.
  • ...and 3 more figures