Vision-Language Navigation with Continual Learning

Zhiyuan Li; Yanfeng Lv; Ziqin Tu; Di Shang; Hong Qiao

Vision-Language Navigation with Continual Learning

Zhiyuan Li, Yanfeng Lv, Ziqin Tu, Di Shang, Hong Qiao

TL;DR

This work addresses the limited generalization of Vision-Language Navigation agents to unseen environments by introducing Vision-Language Navigation with Continual Learning (VLNCL). It couples a dual-loop memory replay mechanism (Dual-SR) with a cross-modal Structured Transformer VLN agent to enable rapid adaptation to new environments while preserving prior knowledge, and it defines Seen Transfer and Unseen Transfer as evaluation metrics. The authors validate VLNCL on a multi-domain VLN setting built from the R2R dataset, achieving state-of-the-art continual learning performance and strong resistance to forgetting, while improving transfer to unseen tasks. The proposed approach offers a practical pathway toward real-world VLN systems capable of continual learning across evolving environments, with a benchmark and metrics to guide future research.

Abstract

Vision-language navigation (VLN) is a critical domain within embedded intelligence, requiring agents to navigate 3D environments based on natural language instructions. Traditional VLN research has focused on improving environmental understanding and decision accuracy. However, these approaches often exhibit a significant performance gap when agents are deployed in novel environments, mainly due to the limited diversity of training data. Expanding datasets to cover a broader range of environments is impractical and costly. We propose the Vision-Language Navigation with Continual Learning (VLNCL) paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge. VLNCL enables agents to maintain an environmental memory and extract relevant knowledge, allowing rapid adaptation to new environments while preserving existing information. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory replay mechanisms integrated with VLN agents. This method facilitates consolidating past experiences and enhances generalization across new tasks. By utilizing a multi-scenario memory buffer, the agent efficiently organizes and replays task memories, thereby bolstering its ability to adapt quickly to new environments and mitigating catastrophic forgetting. Our work pioneers continual learning in VLN agents, introducing a novel experimental setup and evaluation metrics. We demonstrate the effectiveness of our approach through extensive evaluations and establish a benchmark for the VLNCL paradigm. Comparative experiments with existing continual learning and VLN methods show significant improvements, achieving state-of-the-art performance in continual learning ability and highlighting the potential of our approach in enabling rapid adaptation while preserving prior knowledge.

Vision-Language Navigation with Continual Learning

TL;DR

Abstract

Paper Structure (16 sections, 16 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 16 equations, 3 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Vision-and-Language Navigation
Continual Learning
Method
Setting of Vision-Language Navigation
Formulation of Vision-Language Navigation with Continual Learning
Dual-loop Scenario Replay
Structured Transformer VLN Agents with Continual Learning
Experiments
Experiment Setup
Evaluation Protocol for VLNCL
Implementation Details
Comparative Experiment Results
Resisting Forgetting and Transferring Evaluation
...and 1 more sections

Figures (3)

Figure 1: The pipeline of Vision-Language Navigation with Continual Learning (VLNCL). The agent is trained in the seen dataset to achieve the base agent. When encountering various unseen tasks, the VLNCL paradigm requires the agent to continuously learn from new tasks while not forgetting former scene information.
Figure 2: The overview of Dual-loop Scenario Replay (Dual-SR) algorithm for the VLN agent. When the agent receives new, unseen task domain data, it randomly replays former tasks from the memory buffer to update the inner loop. After the new task domain learning process finishes, the agent performs the outer loop update to balance agent parameters.
Figure 3: The demonstration of success rate change while continuously learning new task domains. Part a is the performance in the Val Unseen split and evaluated on the unseen task domain set $\mathcal{S}_{unseen}^T$ to demonstrate the knowledge transfer capability. Parts b and c are performances in Train Seen and Val Seen splits on the seen task domain set $\mathcal{S}_{seen}^T$ to demonstrate the resistance ability to forget.

Vision-Language Navigation with Continual Learning

TL;DR

Abstract

Vision-Language Navigation with Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)