Advances in Embodied Navigation Using Large Language Models: A Survey

Jinzhou Lin; Han Gao; Xuxiang Feng; Rongtao Xu; Changwei Wang; Man Zhang; Li Guo; Shibiao Xu

Advances in Embodied Navigation Using Large Language Models: A Survey

Jinzhou Lin, Han Gao, Xuxiang Feng, Rongtao Xu, Changwei Wang, Man Zhang, Li Guo, Shibiao Xu

TL;DR

This survey addresses the problem of enabling robust embodied navigation through large language models by dissecting how LLMs support grounded language understanding and few-shot planning within multimodal perception loops. It catalogs state-of-the-art LLM-based navigation architectures, contrasts them with non-LLM VLN baselines, and analyzes common datasets and evaluation metrics, including SPL. Key contributions include a comprehensive review of LLM-based approaches, a dataset-oriented analysis highlighting strengths and gaps, and a discussion of challenges and future directions such as multimodal fusion, memory, and standardized benchmarks. The findings underscore the potential of LLMs to enhance navigation via sophisticated reasoning and semantic understanding, while also emphasizing practical constraints like computation, data quality, and real-time latency. Overall, the work provides a structured guide for researchers to design, compare, and benchmark embodied navigation systems that leverage LLMs for real-world applicability.

Abstract

In recent years, the rapid advancement of Large Language Models (LLMs) such as the Generative Pre-trained Transformer (GPT) has attracted increasing attention due to their potential in a variety of practical applications. The application of LLMs with Embodied Intelligence has emerged as a significant area of focus. Among the myriad applications of LLMs, navigation tasks are particularly noteworthy because they demand a deep understanding of the environment and quick, accurate decision-making. LLMs can augment embodied intelligence systems with sophisticated environmental perception and decision-making support, leveraging their robust language and image-processing capabilities. This article offers an exhaustive summary of the symbiosis between LLMs and embodied intelligence with a focus on navigation. It reviews state-of-the-art models, research methodologies, and assesses the advantages and disadvantages of existing embodied navigation models and datasets. Finally, the article elucidates the role of LLMs in embodied intelligence, based on current research, and forecasts future directions in the field. A comprehensive list of studies in this survey is available at https://github.com/Rongtao-Xu/Awesome-LLM-EN.

Advances in Embodied Navigation Using Large Language Models: A Survey

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 6 figures, 6 tables)

This paper contains 14 sections, 1 equation, 6 figures, 6 tables.

Introduction
Background
Large Language Models
Embodied Intelligence
LLMs in Embodied Navigation
LLMs for Grounded Language Understanding
LLMs for Few-Shot Planning
Embodied Navigation
LLM-based Model
Other Models
Comparison
Datasets
Challenges and Future Directions
Conclusion

Figures (6)

Figure 1: This presentation exhibit a temporal map depicting the works of embodied navigation from 2022 to 2024, and we selected 5 typical works to showcase their corresponding framework diagrams. The map illustrates the evolution of major works, offering valuable insights into the advancement of Embodied Agents.
Figure 2: The first type utilizes LLMs to analyze incoming visual or textual data to extract goal-relevant information, upon which exploration policies subsequently generate appropriate actions to guide agent movement. LLMs, by acquiring information from text and other visual models processed through images, perform semantic understanding rather than planning. They extract key information sequences such as targets and locations, and hand them over to exploration algorithms. The exploration algorithms generate actions to guide agents in navigation to complete tasks.
Figure 3: The second type employs LLMs as planners that directly generate actions, thereby leveraging exploration policies to control agents. LLMs, by acquiring information from text and other visual models processed through images, perform planning (using a dialogue format as an example here) and hand the actions over to exploration algorithms and agents for navigation to complete tasks.
Figure 4: This figure is an example diagram for Planning.
Figure 5: This figure is an example diagram for semantic understanding.
...and 1 more figures

Advances in Embodied Navigation Using Large Language Models: A Survey

TL;DR

Abstract

Advances in Embodied Navigation Using Large Language Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (6)