Table of Contents
Fetching ...

A Recurrent Vision-and-Language BERT for Navigation

Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould

TL;DR

This work introduces VLN BERT, a recurrent, time-aware Vision-and-Language BERT designed for vision-and-language navigation under partial observability. By keeping language tokens fixed after initialization and updating a history-aware state through the Transformer, the model achieves state-of-the-art results on R2R and REVERIE while maintaining memory efficiency. The approach supports pre-training and multi-task capabilities, enabling navigation and referring expression tasks with a single architecture. Experimental results demonstrate strong generalization to unseen environments and efficient learning, highlighting the practical impact of integrating recurrence into V&L BERT for navigation and beyond.

Abstract

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language(V&L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of solving navigation and referring expression tasks simultaneously.

A Recurrent Vision-and-Language BERT for Navigation

TL;DR

This work introduces VLN BERT, a recurrent, time-aware Vision-and-Language BERT designed for vision-and-language navigation under partial observability. By keeping language tokens fixed after initialization and updating a history-aware state through the Transformer, the model achieves state-of-the-art results on R2R and REVERIE while maintaining memory efficiency. The approach supports pre-training and multi-task capabilities, enabling navigation and referring expression tasks with a single architecture. Experimental results demonstrate strong generalization to unseen environments and efficient learning, highlighting the practical impact of integrating recurrence into V&L BERT for navigation and beyond.

Abstract

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language(V&L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of solving navigation and referring expression tasks simultaneously.

Paper Structure

This paper contains 48 sections, 22 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Recurrent multi-layer Transformer for addressing partially observable inputs. A state token is defined along with the input sequence. At each time step, a new state representation $\boldsymbol{s}_{t}$ will be generated based on the new observation. Meanwhile, the past information will help inferring a new decision $d_{t}$.
  • Figure 2: Schematics of the Recurrent Vision-and-Language BERT. At the initialisation stage, the entire instruction is encoded by a multi-layer Transformer, where the output feature of the [CLS] token serves as the initial state representation of the agent. During navigation, the concatenated sequence of state, encoded language and new visual observation is fed to the same Transformer to obtain the updated state and decision probabilities. The updated state and the language encoding from initialisation will be fused and applied as input at the next time step. The green star ( ) indicates the cross-modal matching (Eq. \ref{['eqn:matching']}) and the past decision encoding (Eq. \ref{['eqn:stateaction']}) in State Refinement.
  • Figure 3: Averaged attention weights over all instructions in validation unseen split during navigation. State: Attention weights with respect to the state representation. Selected Action: Attention weights with respect to the visual token at the selected direction.
  • Figure 4: Comparison of the learning curves. no init. means randomly initialised network parameters.
  • Figure 5: Adaptation to recurrent PREVALENT. At initialisation, the entire instruction is encoded by a language transformer (TRM-Lang1), where the output feature of the [CLS] token servers as the initial state representation of the agent. During navigation, the concatenated sequence of state, encoded language and new visual observation are fed to the cross-modality and the single-modality encoders to obtain the updated state and decision probabilities. The updated state and the language encoding from initialisation will be fused and applied as input at the next time step. The green star ( ) indicates the cross-modal matching and the past decision encoding ( 3.3).
  • ...and 3 more figures