Table of Contents
Fetching ...

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

Martin Büchner, Adrian Röfer, Tim Engelbracht, Tim Welschehold, Zuria Bauer, Hermann Blum, Marc Pollefeys, Abhinav Valada

TL;DR

MoMa-SG addresses the gap between semantic understanding and kinematic prediction for open-world mobile manipulation by building semantic-kinematic 3D scene graphs from in-the-wild RGB-D observations. The pipeline segments interactions, estimates twist-based articulation models in $SE(3)$ with a regularization that disambiguates revolute and prismatic joints, and constructs a hierarchical graph linking articulated parents with contained objects. A new Arti4D-Semantic dataset provides real-world, open-world articulated scenes with per-object articulation axes, contained-object labels, and multiple observation paradigms. Real-world experiments on two mobile manipulators demonstrate robust manipulation guided by the semantic-kinematic graphs, and code/data are released to enable broader adoption.

Abstract

Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

TL;DR

MoMa-SG addresses the gap between semantic understanding and kinematic prediction for open-world mobile manipulation by building semantic-kinematic 3D scene graphs from in-the-wild RGB-D observations. The pipeline segments interactions, estimates twist-based articulation models in with a regularization that disambiguates revolute and prismatic joints, and constructs a hierarchical graph linking articulated parents with contained objects. A new Arti4D-Semantic dataset provides real-world, open-world articulated scenes with per-object articulation axes, contained-object labels, and multiple observation paradigms. Real-world experiments on two mobile manipulators demonstrate robust manipulation guided by the semantic-kinematic graphs, and code/data are released to enable broader adoption.

Abstract

Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.
Paper Structure (34 sections, 23 equations, 18 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 23 equations, 18 figures, 11 tables, 1 algorithm.

Figures (18)

  • Figure 1: MoMa-SG enables the construction of accurate 3D scene graphs over articulated scenes and serves as a backbone for long-horizon mobile manipulation.
  • Figure 2: MoMa-SG enables the construction of accurate 3D scene graphs over articulated scenes and serves as a backbone for long-horizon mobile manipulation. We first discover interaction segments (\ref{['sec:interaction_disc']}), then attain object articulation models $\mathcal{A}$ by estimating twists from point trajectories (\ref{['sec:articulation_estimation']}). Next, we match mapped objects $\mathcal{O}$ against articulations and discover objects contained in respective articulated parents (\ref{['sec:articulated_scene_graph']}).
  • Figure 3: Labels contained in Arti4D-Semantic: Solid circled labels denote articulated parent parts and dashed-circled labels represent articulated labels.
  • Figure 4: Contained objects discovered using MoMa-SG across different scenes of Arti4D-Semantic.
  • Figure 5: Qualitative results of MoMa-SG on Arti4D-Semantic: Estimated axis positions and corresponding object masks. As demonstrated, we observe minimal errors for a large number of prismatic objects and small errors on revolute objects.
  • ...and 13 more figures