Table of Contents
Fetching ...

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

TL;DR

The paper tackles Vision-and-Language Navigation in Continuous Environments (VLN-CE), where agents must follow natural language instructions in unseen 3D spaces. It introduces SASRA, a hybrid transformer-recurrence agent that builds a temporal semantic memory via local ego-centric maps and aligns semantic maps with language through cross-modal transformers (SLAM-T and RGBD-Linguistic) and a Hybrid Action Decoder, trained with Teacher-Forcing and DAGGER. Key contributions include first-end-to-end integration of semantic mapping with language for VLN-CE, novel cross-modal attention mechanisms, and comprehensive ablations showing substantial gains over state-of-the-art baselines. Empirical results in the Habitat VLN-CE benchmark demonstrate significant improvements in SPL and SR, particularly with DAGGER, and qualitative analyses illustrate robust long-horizon navigation in unseen environments.

Abstract

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

TL;DR

The paper tackles Vision-and-Language Navigation in Continuous Environments (VLN-CE), where agents must follow natural language instructions in unseen 3D spaces. It introduces SASRA, a hybrid transformer-recurrence agent that builds a temporal semantic memory via local ego-centric maps and aligns semantic maps with language through cross-modal transformers (SLAM-T and RGBD-Linguistic) and a Hybrid Action Decoder, trained with Teacher-Forcing and DAGGER. Key contributions include first-end-to-end integration of semantic mapping with language for VLN-CE, novel cross-modal attention mechanisms, and comprehensive ablations showing substantial gains over state-of-the-art baselines. Empirical results in the Habitat VLN-CE benchmark demonstrate significant improvements in SPL and SR, particularly with DAGGER, and qualitative analyses illustrate robust long-horizon navigation in unseen environments.

Abstract

This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.

Paper Structure

This paper contains 12 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview: Vision and Language Navigation task and our proposed SASRA agent. Our main novelty lies in employing a hybrid transformer-recurrence model for VLN by utilising a cross-modal Semantic-Linguistic Attention Map Transformer (SLAM-T). At each time-step $t$, the agent generates a local semantic map from visual observations. The agent consistently reasons with the environment in both spatial (cross modal attention between semantic map and language) and temporal (preserving previous states information through time) domains to decode the low-level actions ($a_{t}$) at each time-step.
  • Figure 2: Model Architecture (Detailed model): Our approach utilises learning-based cross modal attention modules. Semantic-Linguistic Attention Map Transformer (SLAM-T) and RGBD-Linguistic Transformer consistently reason between visual and textual spatial domains. Hybrid Action Decoder captures the temporal dependencies inherent in following a trajectory over time.
  • Figure 3: Model Architecture (Individual Blocks): Our model utilises fully-attentive Transformer blocks at each stage. a) We encode language using a self-attention Transformer. b) Visual-Linguistic attention is performed using two-stage Transformer blocks employing both self and cross-attention. c) Action decoder comprises of a single cross-modal Transformer module.
  • Figure 4: Qualitative Analysis: Figure shows instruction-following trajectories of our SASRA agent in unseen environments within VLN-CE. Sample observations (Clockwise: RGB, Semantics, Depth and Semantic Map) seen by the agent and the corresponding actions (overlayed with RGB) are shown at each timestep. Note that the top-down map (shown on the right) is not available to the agent and is only used for performance evaluation.