Table of Contents
Fetching ...

Towards Learning a Generalist Model for Embodied Navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, Liwei Wang

TL;DR

This work introduces NaviLLM, the first generalist embodied navigation model that unifies diverse vision-language and embodied tasks through schema-based instruction, enabling a single model to learn from multiple datasets. By coupling a scene encoder with a generation-capable LLM and converting tasks into generation prompts, NaviLLM achieves state-of-the-art results on CVDN, SOON, and ScanQA and demonstrates strong zero-shot generalization to unseen tasks like embodied question answering and 3D captioning. The approach leverages multi-task learning and a two-stage training pipeline, showing that dataset diversity and cross-task transfer substantially improve generalization. The results underscore the potential of generalist, LLM-backed agents for versatile, real-world embodied navigation and interaction.

Abstract

Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.

Towards Learning a Generalist Model for Embodied Navigation

TL;DR

This work introduces NaviLLM, the first generalist embodied navigation model that unifies diverse vision-language and embodied tasks through schema-based instruction, enabling a single model to learn from multiple datasets. By coupling a scene encoder with a generation-capable LLM and converting tasks into generation prompts, NaviLLM achieves state-of-the-art results on CVDN, SOON, and ScanQA and demonstrates strong zero-shot generalization to unseen tasks like embodied question answering and 3D captioning. The approach leverages multi-task learning and a two-stage training pipeline, showing that dataset diversity and cross-task transfer substantially improve generalization. The results underscore the potential of generalist, LLM-backed agents for versatile, real-world embodied navigation and interaction.

Abstract

Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.
Paper Structure (28 sections, 4 equations, 3 figures, 13 tables)

This paper contains 28 sections, 4 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Comparison between previous methods and ours. Previous methods learn task-specific navigation agents, suffer from a low success rate for out-of-domain VLN, and fall short when facing unseen tasks (e.g., QA and summarization). The different colors are used to represent different examples. For instance, orange represents an example from In-domain VLN. Our NaviLLM not only excels in diverse tasks required by embodied navigation, but also demonstrates promising generalizability even on unseen tasks.
  • Figure 2: The overview of NaviLLM. The left figure presents the architecture and workflow of our model, while the right figure illustrates the schema-based instruction and multi-task learning process in our method.
  • Figure 3: The visualization for our method on unseen scenes and unseen tasks. In Figure (a), lines and text of the same color represent sub-trajectories and their corresponding sub-instructions. In Figures (b) and (c), the text in gray is the description of the actions of the agent during navigation, while the red arrow indicates the direction that the agent moves towards.