Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Xinshuai Song; Weixing Chen; Yang Liu; Weikai Chen; Guanbin Li; Liang Lin

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin

TL;DR

Long-Horizon Vision-Language Navigation (LH-VLN) is formulated to address multi-stage, context-rich navigation in dynamic environments. The authors introduce NavGen, an automated bidirectional data-generation platform, and LHPR-VLN, a large-scale benchmark with 3,260 tasks (avg. 150 steps) to support robust long-horizon evaluation; they also propose MGDM, a Multi-Granularity Dynamic Memory module that fuses short-term memory blurring with long-term memory retrieval to preserve coherence across extended trajectories. To enable fine-grained assessment, ISR, CSR, CGT (and TAR) metrics are defined, capturing sub-task and sequence-level performance beyond traditional VLN metrics. Empirical results show MGDM achieves state-of-the-art performance on LH-VLN by maintaining coherent reasoning and adaptive memory across multi-stage tasks, demonstrating the practical value of the platform, benchmark, and memory-centric approach for real-world long-horizon navigation.

Abstract

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

TL;DR

Abstract

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)