Table of Contents
Fetching ...

Vision-and-Language Navigation Generative Pretrained Transformer

Wen Hanlin

TL;DR

This proposal adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules, and separates the training process into offline pre-training with imitation learning and online fine-tuning with reinforcement learning.

Abstract

In the Vision-and-Language Navigation (VLN) field, agents are tasked with navigating real-world scenes guided by linguistic instructions. Enabling the agent to adhere to instructions throughout the process of navigation represents a significant challenge within the domain of VLN. To address this challenge, common approaches often rely on encoders to explicitly record past locations and actions, increasing model complexity and resource consumption. Our proposal, the Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT), adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. This method allows for direct historical information access through trajectory sequence, enhancing efficiency. Furthermore, our model separates the training process into offline pre-training with imitation learning and online fine-tuning with reinforcement learning. This distinction allows for more focused training objectives and improved performance. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.

Vision-and-Language Navigation Generative Pretrained Transformer

TL;DR

This proposal adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules, and separates the training process into offline pre-training with imitation learning and online fine-tuning with reinforcement learning.

Abstract

In the Vision-and-Language Navigation (VLN) field, agents are tasked with navigating real-world scenes guided by linguistic instructions. Enabling the agent to adhere to instructions throughout the process of navigation represents a significant challenge within the domain of VLN. To address this challenge, common approaches often rely on encoders to explicitly record past locations and actions, increasing model complexity and resource consumption. Our proposal, the Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT), adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. This method allows for direct historical information access through trajectory sequence, enhancing efficiency. Furthermore, our model separates the training process into offline pre-training with imitation learning and online fine-tuning with reinforcement learning. This distinction allows for more focused training objectives and improved performance. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
Paper Structure (16 sections, 15 equations, 4 figures, 6 tables)

This paper contains 16 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Demonstration of an example from the R2R validation dataset r2r. More details of qualitative results seen in the Appendix.
  • Figure 2: The architecture of Vision-and-Language Navigation Generative Pretrained Transformer, namely VLN-GPT. VLN-GPT adopts a transformer decoder to model the dependencies of instruction, returns, observations, and actions in the trajectory sequence and predict action on the observation token at each time step $t$.
  • Figure 3: Success Rate (SR) and Success weighted by Path Length (SPL) outcomes from GPT models with varying parameter scales on the R2R validation dataset are depicted in \ref{['fig:sub1']} for SR and \ref{['fig:sub2']} for SPL, respectively. To create parameter scale variants of the GPT model, we adjust the number of transformer blocks. However, due to computational power constraints, experiments involving more than 20 transformer blocks are unfeasible.
  • Figure 4: Demonstration of examples from the R2R validation dataset. The sentence at the top is the instruction of this example. The background image is an overhead view of the navigation room. The green arrows denote the trajectory of our VLN-GPT agent, and the blue one is the trajectory from PREVALENT as the base model with transformer encoder architecture. The blue point in the figure is the starting point of the trajectory, and the orange point is the endpoint. The text label $\checkmark$ means the agent successfully reaches the intended target through the trajectory, and the text label $\times$ means the agent fails to navigate to the target.