Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Yue Zhang; Ziqiao Ma; Jialu Li; Yanyuan Qiao; Zun Wang; Joyce Chai; Qi Wu; Mohit Bansal; Parisa Kordjamshidi

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, Parisa Kordjamshidi

TL;DR

This survey analyzes Vision-and-Language Navigation (VLN) in the era of foundation models, adopting a top-down LAW framework that splits challenges into world modeling, human-robot communication, and embodied navigation. It surveys how large language and vision-language models enable world representations, grounded instruction understanding, and planning, while also examining benchmarks, data limitations, and the transition from simulation to real-robot deployment. Key contributions include a systematic taxonomy of VLN research under foundation-model guidance, coverage of memory, generalization, grounding, and planning techniques, and a forward-looking discussion of 3D world representations, interactive dialogue, and real-world deployment. The work highlights opportunities to integrate 3D-centric perception, open-ended dialogue, and memory-augmented reasoning to push VLN toward robust, real-world embodied agents with broad applicability.

Abstract

Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

TL;DR

Abstract

Paper Structure (41 sections, 2 figures, 1 table)

This paper contains 41 sections, 2 figures, 1 table.

Introduction
Background and Task Formulations
Cognitive Underpinnings of VLN
Relevant Tasks and Scope of the Survey
VLN Task Formulations and Benchmarks
VLN Task Definition.
Benchmarks.
Evaluation Metrics.
Foundation Models
World Model: Learning and Representing the Visual Environments
History and Memory
History Encoding.
Graph-based History.
Generalization across Environments
Pre-trained Visual Representations.
...and 26 more sections

Figures (2)

Figure 1: Organizing challenges and solutions in VLN using LAW framework hu2023language.
Figure 2: VLN challenges and solutions within the framework of world model, human model, and VLN agent. We discuss history and memory in the world model, ambiguous instructions in the human model, generalization ability in them both. For the VLN agent, we discuss methods for grounding and reasoning, planning, and adapting foundation models as agents. Depending on the role served by the foundation models, we categorize these methods into four types. Additionally, we discuss the potential future of the foundation model for the VLN task.

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

TL;DR

Abstract

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)