Table of Contents
Fetching ...

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Bingqian Lin, Yi Zhu, Xiaodan Liang, Liang Lin, Jianzhuang Liu

TL;DR

Vision-Language Navigation requires aligning complex visual observations with natural language instructions, a problem aggravated by modal semantic gaps. The authors propose Actional Atomic-Concept Learning (AACL), which maps observations to actional concepts formed by an atomic action and an object, using CLIP-based object concepts and a concept refining adapter to align with instructions, followed by an observation co-embedding strategy with a contrastive objective. This three-component framework yields state-of-the-art results on R2R, REVERIE, and R2R-Last benchmarks and provides enhanced interpretability for action decisions. Overall, AACL improves cross-modal alignment and reliability in VLN, enabling more robust navigation in real-world scenarios.

Abstract

Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., ``go up stairs''. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

TL;DR

Vision-Language Navigation requires aligning complex visual observations with natural language instructions, a problem aggravated by modal semantic gaps. The authors propose Actional Atomic-Concept Learning (AACL), which maps observations to actional concepts formed by an atomic action and an object, using CLIP-based object concepts and a concept refining adapter to align with instructions, followed by an observation co-embedding strategy with a contrastive objective. This three-component framework yields state-of-the-art results on R2R, REVERIE, and R2R-Last benchmarks and provides enhanced interpretability for action decisions. Overall, AACL improves cross-modal alignment and reliability in VLN, enabling more robust navigation in real-world scenarios.

Abstract

Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., ``go up stairs''. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.
Paper Structure (21 sections, 16 equations, 7 figures, 7 tables)

This paper contains 21 sections, 16 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison between existing VLN agents and the proposed AACL. Through mapping the visual observations to actional atomic concepts, AACL can simplify the multi-modal alignment and distinguish different observation candidates easily to make accurate action decision.
  • Figure 2: Overview of our Actional Atomic-Concept Learning (AACL). At each timestep $t$, the agent receives the instruction $I$, the observation $O_{t}$, and the navigation history $H_{t}$. For each $O_{t,n}$ in $O_{t}$ containing the single-view image $B_{t,n}$ and the direction $A_{t,n}$, object concept mapping and action concept mapping are conducted based on the concept refining adapter to obtain the actional atomic concept representations $\mathbf{\tilde{u}}_{t,n}$. Then $\mathbf{\tilde{u}}_{t,n}$ is used to regularize the visual representation $\mathbf{v}_{t,n}$ and the directional representation $\mathbf{e}_{A_{t,n}}$ through the observation co-embedding module for making action selection. For simplicity, we omit the learning process of navigation histories $H_{t}$, which is similar to that of observations $O_{t}$.
  • Figure 3: Visualization examples of action selection ((a) and (b)) and object concept mapping ((c)). In (a) and (b), the baseline is HAMT Chen2021HistoryAM. The green boxes denote the correct actions and the red boxes denote the wrong ones.
  • Figure 4: Visualization of the action selections by the baseline method Chen2021HistoryAM and our AACL. The green boxes denote the correct actions and the red boxes denote the wrong ones. After step 6 marked with the grey dashed box, the baseline and AACL make different trajectories.
  • Figure 5: Failure case of AACL. The green boxes denote the correct actions and the red boxes denote the wrong ones. After step 1 marked with the grey dashed box, the ground-truth and AACL have different trajectories.
  • ...and 2 more figures