Table of Contents
Fetching ...

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, Niko Suenderhauf

TL;DR

SayPlan tackles the challenge of grounding long-horizon robot task plans produced by large language models in large-scale, multi-floor environments. It achieves scalability by grounding LLMs in 3D scene graphs and using a semantic search over collapsed graphs to identify a task-relevant subgraph, while delegating navigation to a classical path planner and applying iterative replanning with a scene-graph simulator to ensure feasibility. The approach yields substantial token-efficiency (up to $82.1\%$ token reduction) and near-perfect executability in real-world robotic experiments, outperforming baselines that lack iterative grounding. This work advances scalable, generalizable grounding for service robotics operating in homes, offices, and hospitals, enabling robust execution of natural-language instructions across complex environments.

Abstract

Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant challenge for robotics. We introduce SayPlan, a scalable approach to LLM-based, large-scale task planning for robotics using 3D scene graph (3DSG) representations. To ensure the scalability of our approach, we: (1) exploit the hierarchical nature of 3DSGs to allow LLMs to conduct a 'semantic search' for task-relevant subgraphs from a smaller, collapsed representation of the full graph; (2) reduce the planning horizon for the LLM by integrating a classical path planner and (3) introduce an 'iterative replanning' pipeline that refines the initial plan using feedback from a scene graph simulator, correcting infeasible actions and avoiding planning failures. We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects and show that our approach is capable of grounding large-scale, long-horizon task plans from abstract, and natural language instruction for a mobile manipulator robot to execute. We provide real robot video demonstrations on our project page https://sayplan.github.io.

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning

TL;DR

SayPlan tackles the challenge of grounding long-horizon robot task plans produced by large language models in large-scale, multi-floor environments. It achieves scalability by grounding LLMs in 3D scene graphs and using a semantic search over collapsed graphs to identify a task-relevant subgraph, while delegating navigation to a classical path planner and applying iterative replanning with a scene-graph simulator to ensure feasibility. The approach yields substantial token-efficiency (up to token reduction) and near-perfect executability in real-world robotic experiments, outperforming baselines that lack iterative grounding. This work advances scalable, generalizable grounding for service robotics operating in homes, offices, and hospitals, enabling robust execution of natural-language instructions across complex environments.

Abstract

Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant challenge for robotics. We introduce SayPlan, a scalable approach to LLM-based, large-scale task planning for robotics using 3D scene graph (3DSG) representations. To ensure the scalability of our approach, we: (1) exploit the hierarchical nature of 3DSGs to allow LLMs to conduct a 'semantic search' for task-relevant subgraphs from a smaller, collapsed representation of the full graph; (2) reduce the planning horizon for the LLM by integrating a classical path planner and (3) introduce an 'iterative replanning' pipeline that refines the initial plan using feedback from a scene graph simulator, correcting infeasible actions and avoiding planning failures. We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects and show that our approach is capable of grounding large-scale, long-horizon task plans from abstract, and natural language instruction for a mobile manipulator robot to execute. We provide real robot video demonstrations on our project page https://sayplan.github.io.
Paper Structure (32 sections, 1 equation, 9 figures, 22 tables)

This paper contains 32 sections, 1 equation, 9 figures, 22 tables.

Figures (9)

  • Figure 1: SayPlan Overview (top). SayPlan operates across two stages to ensure scalability: (left) Given a collapsed 3D scene graph and a task instruction, semantic search is conducted by the LLM to identify a suitable subgraph that contains the required items to solve the task; (right) The explored subgraph is then used by the LLM to generate a high-level task plan, where a classical path planner completes the navigational component of the plan; finally, the plan goes through an iterative replanning process with feedback from a scene graph simulator until an executable plan is identified. Numbers on the top-left corners represent the flow of operations.
  • Figure 2: Hierarchical Structure of a 3D Scene Graph. This graph consists of 4 levels. Notes that the room nodes are connected to one another via sequences of pose nodes which capture the topological arrangement of a scene.
  • Figure 3: Scene Graph Token Progression During Semantic Search. This graph illustrates the scalability of our approach to large-scale 3D scene graphs. Note the importance of node contraction in maintaining a near constant token representation of the 3DSG input.
  • Figure 4: Large-scale environments used to evaluate SayPlan. The environments span multiple rooms and floors including a vast range of
  • Figure 5: 3D Scene Graph - Fully Expanded Office Environment. Full 3D scene graph exposing all the rooms, assets and objects available in the scene. Note that the LLM agent never sees all this information unless it chooses to expand every possible node without contraction.
  • ...and 4 more figures