SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs
Guangyao Zhai, Xiaoni Cai, Dianye Huang, Yan Di, Fabian Manhardt, Federico Tombari, Nassir Navab, Benjamin Busam
TL;DR
SG-Bot addresses robotic object rearrangement by learning a goal imagination pipeline on scene graphs in a coarse-to-fine manner. It first extracts objects, builds a scene-graph-based coarse goal, and then uses a Graph-to-3D model with shape priors to generate a fine goal scene $S^*$, followed by per-object point-cloud registration and occupancy-checked execution. The approach yields real-time, controllable planning without requiring predefined goal priors and shows superior performance in both simulation and real-world experiments compared to state-of-the-art baselines. This work advances embodied AI by integrating commonsense reasoning with explicit geometric generation to robustly guide robotic rearrangement. SG-Bot demonstrates practical potential for flexible, interactive scene manipulation in cluttered environments.
Abstract
Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
