Table of Contents
Fetching ...

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin

TL;DR

Long-Horizon Vision-Language Navigation (LH-VLN) is formulated to address multi-stage, context-rich navigation in dynamic environments. The authors introduce NavGen, an automated bidirectional data-generation platform, and LHPR-VLN, a large-scale benchmark with 3,260 tasks (avg. 150 steps) to support robust long-horizon evaluation; they also propose MGDM, a Multi-Granularity Dynamic Memory module that fuses short-term memory blurring with long-term memory retrieval to preserve coherence across extended trajectories. To enable fine-grained assessment, ISR, CSR, CGT (and TAR) metrics are defined, capturing sub-task and sequence-level performance beyond traditional VLN metrics. Empirical results show MGDM achieves state-of-the-art performance on LH-VLN by maintaining coherent reasoning and adaptive memory across multi-stage tasks, demonstrating the practical value of the platform, benchmark, and memory-centric approach for real-world long-horizon navigation.

Abstract

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

TL;DR

Long-Horizon Vision-Language Navigation (LH-VLN) is formulated to address multi-stage, context-rich navigation in dynamic environments. The authors introduce NavGen, an automated bidirectional data-generation platform, and LHPR-VLN, a large-scale benchmark with 3,260 tasks (avg. 150 steps) to support robust long-horizon evaluation; they also propose MGDM, a Multi-Granularity Dynamic Memory module that fuses short-term memory blurring with long-term memory retrieval to preserve coherence across extended trajectories. To enable fine-grained assessment, ISR, CSR, CGT (and TAR) metrics are defined, capturing sub-task and sequence-level performance beyond traditional VLN metrics. Empirical results show MGDM achieves state-of-the-art performance on LH-VLN by maintaining coherent reasoning and adaptive memory across multi-stage tasks, demonstrating the practical value of the platform, benchmark, and memory-centric approach for real-world long-horizon navigation.

Abstract

Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.

Paper Structure

This paper contains 30 sections, 16 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: The NavGen data generation platform. The forward generation generates LH-VLN complex tasks and corresponding subtasks by prompting GPT-4 with sampling asserts. The sampled assets are deployed on the simulator. Based on the navigation model or expert decisions, corresponding trajectory data is generated. In the backward generation, the trajectory of each subtask is split into action-label pairs by trajectory splitting algorithm according to the trajectory type, these pairs are then input into GPT-4 to generate step-by-step tasks.
  • Figure 2: Overview of the LHPR-VLN benchmark statistics. In our statistics, Spot and Stretch robot-type tasks account for 50.5% and 49.5%, respectively. LH-VLN tasks containing 2, 3, and 4 subtasks account for 39.0%, 52.4%, and 8.6%, respectively.
  • Figure 3: The framework of the Multi-Granularity Dynamic Memory (MGDM) model. The CoT feedback module receives task instructions and, based on historical observation of corresponding memory, generates a chain of thought and constructs language prompts. The short-term memory module aims to minimize the entropy of the confidence vector, using pooling operations to forget and blur the memory sequence. The long-term memory module selects and matches data from the dataset to weight the decisions of the LLM, ultimately determining the action to be executed by the agent.
  • Figure 4: Visualization of a partially successful long-horizon navigation of our MGDM. We highlight aligned landmarks by colored bounding boxes in images and words in the instruction using the same color. In the first navigation segment, the agent looks for a towel in the bathroom. It successfully finds both the bathroom and the towel but does not enter the bathroom or gets close enough to the towel for the task to be marked as successful. In the next phase, the agent successfully finds the box in the living room.
  • Figure 5: Statistic of the LH-VLN dataset distribution based on task length and robot configuration. We consider 2, 3, 4 subtasks as Short, Medium, and Long Task, respectively.
  • ...and 5 more figures