Table of Contents
Fetching ...

EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning

Meghan Booker, Grayson Byrd, Bethany Kemp, Aurora Schmidt, Corban Rivera

TL;DR

This work proposes a 3D scene subgraph retrieval framework, called EmbodiedRAG, that is Inspired by the successes of Retrieval-Augmented Generation methods that retrieve query-relevant document chunks for LLM question and answering and demonstrates EmbodiedRAG's ability to significantly reduce input token counts and planning time.

Abstract

Recent advances in Large Language Models (LLMs) have helped facilitate exciting progress for robotic planning in real, open-world environments. 3D scene graphs (3DSGs) offer a promising environment representation for grounding such LLM-based planners as they are compact and semantically rich. However, as the robot's environment scales (e.g., number of entities tracked) and the complexity of scene graph information increases (e.g., maintaining more attributes), providing the 3DSG as-is to an LLM-based planner quickly becomes infeasible due to input token count limits and attentional biases present in LLMs. Inspired by the successes of Retrieval-Augmented Generation (RAG) methods that retrieve query-relevant document chunks for LLM question and answering, we adapt the paradigm for our embodied domain. Specifically, we propose a 3D scene subgraph retrieval framework, called EmbodiedRAG, that we augment an LLM-based planner with for executing natural language robotic tasks. Notably, our retrieved subgraphs adapt to changes in the environment as well as changes in task-relevancy as the robot executes its plan. We demonstrate EmbodiedRAG's ability to significantly reduce input token counts (by an order of magnitude) and planning time (up to 70% reduction in average time per planning step) while improving success rates on AI2Thor simulated household tasks with a single-arm, mobile manipulator. Additionally, we implement EmbodiedRAG on a quadruped with a manipulator to highlight the performance benefits for robot deployment at the edge in real environments.

EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning

TL;DR

This work proposes a 3D scene subgraph retrieval framework, called EmbodiedRAG, that is Inspired by the successes of Retrieval-Augmented Generation methods that retrieve query-relevant document chunks for LLM question and answering and demonstrates EmbodiedRAG's ability to significantly reduce input token counts and planning time.

Abstract

Recent advances in Large Language Models (LLMs) have helped facilitate exciting progress for robotic planning in real, open-world environments. 3D scene graphs (3DSGs) offer a promising environment representation for grounding such LLM-based planners as they are compact and semantically rich. However, as the robot's environment scales (e.g., number of entities tracked) and the complexity of scene graph information increases (e.g., maintaining more attributes), providing the 3DSG as-is to an LLM-based planner quickly becomes infeasible due to input token count limits and attentional biases present in LLMs. Inspired by the successes of Retrieval-Augmented Generation (RAG) methods that retrieve query-relevant document chunks for LLM question and answering, we adapt the paradigm for our embodied domain. Specifically, we propose a 3D scene subgraph retrieval framework, called EmbodiedRAG, that we augment an LLM-based planner with for executing natural language robotic tasks. Notably, our retrieved subgraphs adapt to changes in the environment as well as changes in task-relevancy as the robot executes its plan. We demonstrate EmbodiedRAG's ability to significantly reduce input token counts (by an order of magnitude) and planning time (up to 70% reduction in average time per planning step) while improving success rates on AI2Thor simulated household tasks with a single-arm, mobile manipulator. Additionally, we implement EmbodiedRAG on a quadruped with a manipulator to highlight the performance benefits for robot deployment at the edge in real environments.

Paper Structure

This paper contains 28 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Left: An illustration of the EmbodiedRAG framework. As a natural language task comes in, an LLM provides a quick abstraction for pre-retrieving task-relevant entities and corresponding attributes without knowledge of the environment. As the robot begins executing the task, it builds a 3DSG representing the environment, nodes in the 3DSG are indexed and updated in a vector store, and task-relevant entities are retrieved and grounded to a subgraph. A ReAct-style LLM agent then generates plans for the robot to execute using the subgraphs resulting in actions that affect the environment and thoughts on downstream actions that are used by an LLM-based self-query mechanism to identify entities and attributes that are task-relevant to the particular operating environment and plan execution. Right: Examples of scene graphs generated for the robot's operating environment in simulation (top) and hardware (bottom).
  • Figure 2: Average performance of agents evaluated across three experiments with a varying number of distractor objects present in the 3DSG (either 290 or 1,135 distractor objects). Each experiment consists of 40 different tasks (20 easy, 20 hard) in five AI2Thor kitchen scenes. (top) Success rates. Darker and lighter colors reflect the proportion of successful easy and hard tasks respectively. Error bars (black lines) are the standard deviation of the total success rates across the three experiments. (middle) Average time in seconds per planning step. (bottom) Average cumulative tokens used to represent the 3DSG to the LLM-based planner. Note for GPT-4o-mini with 1,135 distractor objects, an average of 15 tasks per experiment did not complete due to input token count limits. Approximated token count is shown in light yellow using the average cumulative tokens used per task.
  • Figure 3: Modes of failure.
  • Figure 4: Example task completion in AI2Thor by EmbodiedRAG-feedback agent. Task: Pick up the credit card that is on the counter top and place it in the drawer. Prior to task execution, the agent generates abstraction {credit card, counter top, drawer}. (a) the agent starts in the kitchen, (b) after some exploration, the credit card is visible to the agent, added to the 3DSG, and relayed to the agent via retrieval, (c) the agent takes the credit card to the kitchen drawer, but it's closed, (d) the agent adapts by placing the card by the kitchen sink, before (e) moving back to the drawer and opening it. (f) the agent moves back to the sink to pick up the card again where the self-query mechanism includes additional attribute information for the credit card isPickedUp, before (g) moving back to the open drawer, and (h) placing the credit card in the drawer to complete the task.
  • Figure 5: Example of the self-query retrieval adding task-relevant state information (orange) for the object Spatula_1 that was already in the retrieved subgraph (blue).
  • ...and 1 more figures