PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

Liuyi Wang; Chengju Liu; Zongtao He; Shu Li; Qingqing Yan; Huiyi Chen; Qijun Chen

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

Liuyi Wang, Chengju Liu, Zongtao He, Shu Li, Qingqing Yan, Huiyi Chen, Qijun Chen

TL;DR

Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions in real environments, but dataset size limits hinder speaker quality. PASTS introduces a progress-aware spatio-temporal transformer speaker with a spatio-temporal encoder, a Speaker Progress Monitor (SPM), and Multifeature Dropout (MFD) to improve instruction realism and alignment. It uses a back-translation data augmentation pipeline with the PREVALENT dataset and FGR2R supervision, achieving state-of-the-art performance on the R2R VLN benchmark for both speaker and follower when integrated with existing followers. The approach enhances generalization to unseen environments and demonstrates flexibility to be combined with diverse VLN followers, signaling practical impact for scalable VLN systems.

Abstract

Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms all existing speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

TL;DR

Abstract

Paper Structure (28 sections, 10 equations, 11 figures, 5 tables)

This paper contains 28 sections, 10 equations, 11 figures, 5 tables.

Introduction
Related Work
Vision-and-Language Navigation
Transformers for Visual Captioning
Auxiliary Tasks
Method
Spatio-Temporal Transformer Encoder
Speaker Progress Monitor (SPM)
Multifeature Dropout (MFD)
Training Procedures
Experiments
Dataset
Metrics
Implementation Details
Main Results for Different Speaker Models
...and 13 more sections

Figures (11)

Figure 1: Illustration of the follower-speaker system in VLN, where the follower aims to predict the action based on instructions, and the speaker aims to generate instructions based on trajectories. In this paper, the proposed PASTS speaker has the ability to recognize different stages for navigation (shown in different colors) and generate more accurate and fine-grained instructions.
Figure 2: Overall architecture of the PASTS framework, which consists of three sub-modules: (a) the spatio-temporal encoder is to integrate dominant features from environments at each location, and encode successive action features in the time dimension; (b) the generation decoder is responsible for converting the inputs of fused visual information and shifted word into a sequence of target probabilities; (c) the word prediction head and the progress prediction head are designed to predict instruction words and progress values, respectively.
Figure 3: Illustration of the spatial encoder (top) and the temporal encoder (bottom). The spatial encoder is used to effectively fuse the action embedding and the environment embedding in each step, and the temporal encoder is applied to capture the internal connection between different navigation steps.
Figure 4: Illustration of the speaker progress monitor (SPM). The complete trajectories and instructions are divided into several subsets. Each word is assigned the corresponding progress value (shown in brown) to align the instructions and trajectories. Different colors are used to represent different navigation stages.
Figure 5: Illustration of the multifeature dropout (MFD). In order to avoid serious overfitting, five dropouts at different positions are proposed to increase the diversity of the network structure. (a) represents the input representation, (b) indicates the attention mechanism, (c) represents the feed-forward network, and (d) is the output projection.
...and 6 more figures

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

TL;DR

Abstract

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)