CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Haozhou Li; Xiangyu Dong; Huiyan Jiang; Yaoming Zhou; Xiaoguang Ma

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

TL;DR

This work proposes CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases.

Abstract

Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 4 figures, 3 tables)

This paper contains 15 sections, 5 equations, 4 figures, 3 tables.

INTRODUCTION
RELATED WORK
Vision-and-Language Navigation (VLN)
Memory-Augmented and Reflective Agents
METHODOLOGY
Multimodal Experience Memory (MEM)
Retrieval-Augmented Generation Pipeline (RAGP)
Reflection and Memory Update
EXPERIMENTS AND ANALYSIS
Experiment Setup
Simulation Experiments
Ablation Study
Case Study
Experiments on A Real Robot
CONCLUSIONS

Figures (4)

Figure 1: The overall CMMR-VLN framework consists of three modules from left to right. The Multimodal Experience Memory (MEM) performs memory building before navigation. The Retrieval-Augmented Generation Pipeline (RAGP) carries out corresponding prompting and action execution at each navigation step. The Reflection Module conducts reflection and updates the experience in memory after navigation.
Figure 2: Details of the Reflection Module in Fig 1.
Figure 3: Case study 1.
Figure 4: Case study 2.

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

TL;DR

Abstract

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)