Table of Contents
Fetching ...

SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

Guangyao Zhai, Xiaoni Cai, Dianye Huang, Yan Di, Fabian Manhardt, Federico Tombari, Nassir Navab, Benjamin Busam

TL;DR

SG-Bot addresses robotic object rearrangement by learning a goal imagination pipeline on scene graphs in a coarse-to-fine manner. It first extracts objects, builds a scene-graph-based coarse goal, and then uses a Graph-to-3D model with shape priors to generate a fine goal scene $S^*$, followed by per-object point-cloud registration and occupancy-checked execution. The approach yields real-time, controllable planning without requiring predefined goal priors and shows superior performance in both simulation and real-world experiments compared to state-of-the-art baselines. This work advances embodied AI by integrating commonsense reasoning with explicit geometric generation to robustly guide robotic rearrangement. SG-Bot demonstrates practical potential for flexible, interactive scene manipulation in cluttered environments.

Abstract

Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.

SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

TL;DR

SG-Bot addresses robotic object rearrangement by learning a goal imagination pipeline on scene graphs in a coarse-to-fine manner. It first extracts objects, builds a scene-graph-based coarse goal, and then uses a Graph-to-3D model with shape priors to generate a fine goal scene , followed by per-object point-cloud registration and occupancy-checked execution. The approach yields real-time, controllable planning without requiring predefined goal priors and shows superior performance in both simulation and real-world experiments compared to state-of-the-art baselines. This work advances embodied AI by integrating commonsense reasoning with explicit geometric generation to robustly guide robotic rearrangement. SG-Bot demonstrates practical potential for flexible, interactive scene manipulation in cluttered environments.

Abstract

Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics, seamlessly blending the consideration of commonsense knowledge with automatic generation capabilities. SG-Bot employs a three-fold procedure--observation, imagination, and execution--to adeptly address the task. Initially, objects are discerned and extracted from a cluttered scene during the observation. These objects are first coarsely organized and depicted within a scene graph, guided by either commonsense or user-defined criteria. Then, this scene graph subsequently informs a generative model, which forms a fine-grained goal scene considering the shape information from the initial scene and object semantics. Finally, for execution, the initial and envisioned goal scenes are matched to formulate robotic action policies. Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
Paper Structure (20 sections, 2 equations, 5 figures, 2 tables)

This paper contains 20 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 2: SG-Bot pipeline.a) SG-Bot segments the input RGB image via MaskRCNN he2017mask to obtain individual object nodes $v_i$. Then, the corresponding point cloud of $v_i$ is obtained via back-projecting the depth map with camera intrinsics $K$. b) Coarse: the graph constructor connects each pair of nodes according to commonsense or user-defined rules, yielding scene graph $\mathcal{G}$. Fine: $\mathcal{G}$ is embedded and enhanced to $\mathcal{G}_z^\beta$ by combining estimated shape priors $\beta^*$ extracted from the normalized point clouds using the trained encoder $\mathcal{B}_E$ and latent code $z$ sampled from the learned layout-shape distribution. $\mathcal{G}_z^\beta$ then informs $\mathit{\Phi}_D$ and $\mathcal{L}_D$ of Graph-to-3D dhamo2021graph to generate shape codes $\alpha^*$ and the scene layout respectively. $\alpha^*$ are decoded as shapes via $\mathcal{A}_D$, which are then populated in the layouts to form the goal scene. c) SG-Bot matches the initial and envisioned goal using point cloud registration and performs an occupancy check to determine the final movement in each step, as illustrated in \ref{['matching']}. The robot iteratively executes the action, transforming scenes into intermediate states and updating the observation until it reaches the goal state.
  • Figure 3: Modular Training.a)$\mathcal{A}_E,\mathcal{A}_D$ are trained using full shapes in the canonical view to have the shape code $\alpha$, while $\mathcal{B}_E,\mathcal{B}_D$ are trained on partial shapes in the initial scenes under the camera view to have the shape priors $\beta$. $\mathcal{A}_D$ and $\mathcal{B}_E$ are retained during inference. b) A scene graph with textual information is processed through embedding layers $\mathcal{M}_O, \mathcal{M}_\Gamma$ to have implicit class features $c_i,c_{i \to j}$ on each node and edge. c) For training Graph-to-3D on goal scenes, the processed scene graph is first concatenated with $\alpha$ and bounding box parameters $B$ on the shape branch $\mathit{\Phi}(\mathit{\Phi}_E,\mathit{\Phi}_D)$ and layout branch $\mathcal{L}(\mathcal{L}_E,\mathcal{L}_D)$ respectively. $\mathit{\Phi}$ and $\mathcal{L}$ jointly model the layout-shape distribution $Z$dhamo2021graph. This model incorporates $\beta$ from initial scenes to create $\mathcal{G}_z^\beta$, subsequently estimating $\Hat{\alpha}$ and $\Hat{B}$. Modules in b) and c) are jointly trained, with $\mathcal{M}_O, \mathcal{M}_\Gamma$, $\mathit{\Phi}_D$ and $\mathcal{L}_D$ used during inference.
  • Figure 4: Visualization results in simulation. We compare SG-Bot with state-of-the-art methods StructFormer liu2022structformer and Socratic Models zeng2022socratic. We highlight the superiority of SG-Bot via rectangles.
  • Figure 5: Real-world experiment. a) We tested unseen cross-category objects with a physical manipulator. b) Action decomposition of one trial during the rearrangement.
  • Figure 6: Functional shape priors. Without shape priors, SG-Bot-dummy generates inconsistent shapes (left). SG-Bot controls the generated shapes close to the ground truth (right) with the help of initial shape priors (middle).