Continual Vision-and-Language Navigation
Seongjun Jeong, Gi-Cheon Kang, Seongho Choi, Joochan Kim, Byoung-Tak Zhang
TL;DR
This work addresses continual learning for Vision-and-Language Navigation (VLN) by introducing Continual VLN (CVLN), where agents learn across sequential scene domains and are evaluated on all encountered domains. It defines two instruction modalities, Initial-instruction CVLN (I-CVLN) and Dialogue-based CVLN (D-CVLN), and proposes two rehearsal-based baselines, Perplexity Replay (PerpR) and Episodic Self-Replay (ESR), to mitigate catastrophic forgetting. Empirical results show that standard CL methods underperform on CVLN, while PerpR and ESR effectively leverage replay memory to retain past knowledge while adapting to new domains, with ESR excelling in I-CVLN and PerpR in D-CVLN. The findings highlight the importance of memory construction and sequential decision fidelity for robust navigation in changing environments, and point to future work on continuous environments and CVLN-specific datasets.
Abstract
Developing Vision-and-Language Navigation (VLN) agents typically assumes a \textit{train-once-deploy-once} strategy, which is unrealistic as deployed agents continually encounter novel environments. To address this, we propose the Continual Vision-and-Language Navigation (CVLN) paradigm, where agents learn and adapt incrementally across multiple \textit{scene domains}. CVLN includes two setups: Initial-instruction based CVLN for instruction-following, and Dialogue-based CVLN for dialogue-guided navigation. We also introduce two simple yet effective baselines for sequential decision-making: Perplexity Replay (PerpR), which replays difficult episodes, and Episodic Self-Replay (ESR), which stores and revisits action logits during training. Experiments show that existing continual learning methods fall short for CVLN, while PerpR and ESR achieve better performance by efficiently utilizing replay memory.
