Table of Contents
Fetching ...

Continual Vision-and-Language Navigation

Seongjun Jeong, Gi-Cheon Kang, Seongho Choi, Joochan Kim, Byoung-Tak Zhang

TL;DR

This work addresses continual learning for Vision-and-Language Navigation (VLN) by introducing Continual VLN (CVLN), where agents learn across sequential scene domains and are evaluated on all encountered domains. It defines two instruction modalities, Initial-instruction CVLN (I-CVLN) and Dialogue-based CVLN (D-CVLN), and proposes two rehearsal-based baselines, Perplexity Replay (PerpR) and Episodic Self-Replay (ESR), to mitigate catastrophic forgetting. Empirical results show that standard CL methods underperform on CVLN, while PerpR and ESR effectively leverage replay memory to retain past knowledge while adapting to new domains, with ESR excelling in I-CVLN and PerpR in D-CVLN. The findings highlight the importance of memory construction and sequential decision fidelity for robust navigation in changing environments, and point to future work on continuous environments and CVLN-specific datasets.

Abstract

Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.

Continual Vision-and-Language Navigation

TL;DR

This work addresses continual learning for Vision-and-Language Navigation (VLN) by introducing Continual VLN (CVLN), where agents learn across sequential scene domains and are evaluated on all encountered domains. It defines two instruction modalities, Initial-instruction CVLN (I-CVLN) and Dialogue-based CVLN (D-CVLN), and proposes two rehearsal-based baselines, Perplexity Replay (PerpR) and Episodic Self-Replay (ESR), to mitigate catastrophic forgetting. Empirical results show that standard CL methods underperform on CVLN, while PerpR and ESR effectively leverage replay memory to retain past knowledge while adapting to new domains, with ESR excelling in I-CVLN and PerpR in D-CVLN. The findings highlight the importance of memory construction and sequential decision fidelity for robust navigation in changing environments, and point to future work on continuous environments and CVLN-specific datasets.

Abstract

Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.
Paper Structure (16 sections, 8 equations, 4 figures, 4 tables)

This paper contains 16 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between (a) VLN and (b) CVLN: VLN trains agents on fixed environments and evaluates them on unseen ones, while CVLN trains agents sequentially on new environments and evaluates them on both previously encountered and newly learned environments.
  • Figure 2: Comparison between I-CVLN and D-CVLN. In I-CVLN, the agent is given an initial instruction containing all the information about the navigation path. Conversely, in D-CVLN, the agent obtains information about the navigation path through communication with an oracle.
  • Figure 3: Overview of Perplexity Replay (PerpR) and Episodic Self-Replay (ESR) for CVLN: (a) PerpR prioritizes challenging episodes with high Action Perplexity (AP), and (b) ESR enables self-replay using past behaviors.
  • Figure 4: Stability-plasticity trade-off comparison in I-CVLN and D-CVLN: We calculate stability and plasticity for agents after learning 10 scene domains.