Table of Contents
Fetching ...

GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng

TL;DR

This paper proposes a novel approach called Graph-based Robotic Instruction Decomposer (GRID), which leverages scene graphs instead of images to perceive global scene information and iteratively plan subtasks for a given instruction.

Abstract

Recent works have shown that Large Language Models (LLMs) can facilitate the grounding of instructions for robotic task planning. Despite this progress, most existing works have primarily focused on utilizing raw images to aid LLMs in understanding environmental information. However, this approach not only limits the scope of observation but also typically necessitates extensive multimodal data collection and large-scale models. In this paper, we propose a novel approach called Graph-based Robotic Instruction Decomposer (GRID), which leverages scene graphs instead of images to perceive global scene information and iteratively plan subtasks for a given instruction. Our method encodes object attributes and relationships in graphs through an LLM and Graph Attention Networks, integrating instruction features to predict subtasks consisting of pre-defined robot actions and target objects in the scene graph. This strategy enables robots to acquire semantic knowledge widely observed in the environment from the scene graph. To train and evaluate GRID, we establish a dataset construction pipeline to generate synthetic datasets for graph-based robotic task planning. Experiments have shown that our method outperforms GPT-4 by over 25.4% in subtask accuracy and 43.6% in task accuracy. Moreover, our method achieves a real-time speed of 0.11s per inference. Experiments conducted on datasets of unseen scenes and scenes with varying numbers of objects demonstrate that the task accuracy of GRID declined by at most 3.8%, showcasing its robust cross-scene generalization ability. We validate our method in both physical simulation and the real world. More details can be found on the project page https://jackyzengl.github.io/GRID.github.io/.

GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

TL;DR

This paper proposes a novel approach called Graph-based Robotic Instruction Decomposer (GRID), which leverages scene graphs instead of images to perceive global scene information and iteratively plan subtasks for a given instruction.

Abstract

Recent works have shown that Large Language Models (LLMs) can facilitate the grounding of instructions for robotic task planning. Despite this progress, most existing works have primarily focused on utilizing raw images to aid LLMs in understanding environmental information. However, this approach not only limits the scope of observation but also typically necessitates extensive multimodal data collection and large-scale models. In this paper, we propose a novel approach called Graph-based Robotic Instruction Decomposer (GRID), which leverages scene graphs instead of images to perceive global scene information and iteratively plan subtasks for a given instruction. Our method encodes object attributes and relationships in graphs through an LLM and Graph Attention Networks, integrating instruction features to predict subtasks consisting of pre-defined robot actions and target objects in the scene graph. This strategy enables robots to acquire semantic knowledge widely observed in the environment from the scene graph. To train and evaluate GRID, we establish a dataset construction pipeline to generate synthetic datasets for graph-based robotic task planning. Experiments have shown that our method outperforms GPT-4 by over 25.4% in subtask accuracy and 43.6% in task accuracy. Moreover, our method achieves a real-time speed of 0.11s per inference. Experiments conducted on datasets of unseen scenes and scenes with varying numbers of objects demonstrate that the task accuracy of GRID declined by at most 3.8%, showcasing its robust cross-scene generalization ability. We validate our method in both physical simulation and the real world. More details can be found on the project page https://jackyzengl.github.io/GRID.github.io/.
Paper Structure (17 sections, 12 equations, 10 figures, 5 tables)

This paper contains 17 sections, 12 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Our network, GRID, leverages instructions, scene graphs, and robot graphs as inputs for robotic task planning. Both environmental knowledge and the robot’s state are densely represented through graphs. The robot iteratively updates the graphs and executes the subtasks planned by GRID until completing the entire task. GRID can be deployed to robots in different forms, operating effectively in various environments.
  • Figure 2: The architecture diagram of GRID. The instruction, robot graph, and scene graph are all transformed into tokens through $INSTRUCTOR$su_one_2023. Subsequently, GAT modules extract structural information from the graphs. The resulting tokens undergo reinforcement by a feature enhancer and are then fed into a task decoder, ultimately generating outputs for action and object in ID form.
  • Figure 3: Vanilla instruction and graph features are fed into N layers of parallel cross-attention to enhance crossover information between instructions and graphs.
  • Figure 4: The enhanced tokens are segregated into fusion features and graph queries, which are input into the transformer encoder and decoder, respectively. The tokens from the robot graph are mapped to scores for each action, and each token from the scene graph is converted to the score for the corresponding node.
  • Figure 5: Experimental system design in simulation.
  • ...and 5 more figures