Table of Contents
Fetching ...

Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation

Liuyi Wang, Zongtao He, Ronghao Dang, Huiyi Chen, Chengju Liu, Qijun Chen

TL;DR

This paper establishes reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM) and proposes an iterative backdoor-based representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders.

Abstract

Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. However, existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. In this paper, we tackle this challenge by proposing a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations. Specifically, we establish reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM). Building upon this, we propose an iterative backdoor-based representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders. Furthermore, we introduce the visual and linguistic backdoor causal encoders to enable unbiased feature expression for multi-modalities during training and validation, enhancing the agent's capability to generalize across different environments. Experiments on three VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed method over previous state-of-the-art approaches. Moreover, detailed visualization analysis demonstrates the effectiveness of CausalVLN in significantly narrowing down the performance gap between seen and unseen environments, underscoring its strong generalization capability.

Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation

TL;DR

This paper establishes reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM) and proposes an iterative backdoor-based representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders.

Abstract

Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. However, existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. In this paper, we tackle this challenge by proposing a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations. Specifically, we establish reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM). Building upon this, we propose an iterative backdoor-based representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders. Furthermore, we introduce the visual and linguistic backdoor causal encoders to enable unbiased feature expression for multi-modalities during training and validation, enhancing the agent's capability to generalize across different environments. Experiments on three VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed method over previous state-of-the-art approaches. Moreover, detailed visualization analysis demonstrates the effectiveness of CausalVLN in significantly narrowing down the performance gap between seen and unseen environments, underscoring its strong generalization capability.
Paper Structure (38 sections, 19 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 19 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Vision-and-Language Navigation (VLN) involves an agent navigating through visual environments based on language instructions. In this paper, we leverage intervention on both language and vision modalities to learn unbiased features and enhance the generalization of the model.
  • Figure 2: The overview of the proposed CausalVLN model. The visual backdoor causal encoder (a) and linguistic backdoor causal encoder (b) are used to learn the causality-based features for vision and language, respectively. The iterative update strategy for backdoor-based representation learning is employed in the language branch due to the participation of BERT in end-to-end training. The memory-augmented global-local cross-modal fusion (c) and dynamic action prediction (d) are used to enhance long-term navigation and adaptive decision-making.
  • Figure 3: Illustration of the proposed causal graph. $V,T$ and $A$ denote the visual inputs, language inputs, and action prediction, respectively. $Z_v$ and $Z_T$ denote the confounder of the vision and the language. $F_V, F_T$ and $F_X$ are the hidden representations.
  • Figure 4: The statistics of $P(Y|X)$ and $P(Y|do(X))$. Only part of the object pairs is visualized to avoid clutter. With the help of the intervention, the causality of some pairs gets correction and becomes more commonsense.
  • Figure 5: Illustration of the iterative backdoor-based representation learning.
  • ...and 6 more figures