Table of Contents
Fetching ...

SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments

Abhinav Rajvanshi, Karan Sikka, Xiao Lin, Bhoram Lee, Han-Pang Chiu, Alvaro Velasquez

TL;DR

SayNav tackles the challenge of navigating unknown, large-scale environments to locate multiple objects by grounding Large Language Models (LLMs) with an incremental 3D scene graph. It introduces a three-module pipeline: Incremental Scene Graph Generation, a High-Level LLM-Based Dynamic Planner, and a Low-Level PointNav Planner, enabling dynamic, context-aware planning and execution. On a ProcTHOR-based MultiON benchmark, SayNav achieves state-of-the-art success rates and outperforms an oracle baseline under realistic conditions, demonstrating robust dynamic planning even with perception noise. The work highlights the practical impact of combining semantic reasoning with grounded representations for efficient exploration in unseen environments and provides a dataset and implementation to foster further research.

Abstract

Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on multi-object navigation (MultiON) task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. We also introduce a benchmark dataset for MultiON task employing ProcTHOR framework that provides large photo-realistic indoor environments with variety of objects. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. The code, benchmark dataset and demonstration videos are accessible at https://www.sri.com/ics/computer-vision/saynav.

SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments

TL;DR

SayNav tackles the challenge of navigating unknown, large-scale environments to locate multiple objects by grounding Large Language Models (LLMs) with an incremental 3D scene graph. It introduces a three-module pipeline: Incremental Scene Graph Generation, a High-Level LLM-Based Dynamic Planner, and a Low-Level PointNav Planner, enabling dynamic, context-aware planning and execution. On a ProcTHOR-based MultiON benchmark, SayNav achieves state-of-the-art success rates and outperforms an oracle baseline under realistic conditions, demonstrating robust dynamic planning even with perception noise. The work highlights the practical impact of combining semantic reasoning with grounded representations for efficient exploration in unseen environments and provides a dataset and implementation to foster further research.

Abstract

Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on multi-object navigation (MultiON) task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. We also introduce a benchmark dataset for MultiON task employing ProcTHOR framework that provides large photo-realistic indoor environments with variety of objects. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. The code, benchmark dataset and demonstration videos are accessible at https://www.sri.com/ics/computer-vision/saynav.
Paper Structure (20 sections, 1 equation, 10 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: A SayNav example: The robot uses LLM-based planner to efficiently find one target object (laptop) in a new house.
  • Figure 2: The overview of our SayNav framework.
  • Figure 3: An example of our scene graph.
  • Figure 4: Prompt used to create the search plan for a particular room
  • Figure 5: Prompts used to compute the feasibility of finding an object in a room-type and to identify the room-type.
  • ...and 5 more figures