SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning
Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, Niko Suenderhauf
TL;DR
SayPlan tackles the challenge of grounding long-horizon robot task plans produced by large language models in large-scale, multi-floor environments. It achieves scalability by grounding LLMs in 3D scene graphs and using a semantic search over collapsed graphs to identify a task-relevant subgraph, while delegating navigation to a classical path planner and applying iterative replanning with a scene-graph simulator to ensure feasibility. The approach yields substantial token-efficiency (up to $82.1\%$ token reduction) and near-perfect executability in real-world robotic experiments, outperforming baselines that lack iterative grounding. This work advances scalable, generalizable grounding for service robotics operating in homes, offices, and hospitals, enabling robust execution of natural-language instructions across complex environments.
Abstract
Large language models (LLMs) have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant challenge for robotics. We introduce SayPlan, a scalable approach to LLM-based, large-scale task planning for robotics using 3D scene graph (3DSG) representations. To ensure the scalability of our approach, we: (1) exploit the hierarchical nature of 3DSGs to allow LLMs to conduct a 'semantic search' for task-relevant subgraphs from a smaller, collapsed representation of the full graph; (2) reduce the planning horizon for the LLM by integrating a classical path planner and (3) introduce an 'iterative replanning' pipeline that refines the initial plan using feedback from a scene graph simulator, correcting infeasible actions and avoiding planning failures. We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects and show that our approach is capable of grounding large-scale, long-horizon task plans from abstract, and natural language instruction for a mobile manipulator robot to execute. We provide real robot video demonstrations on our project page https://sayplan.github.io.
