Articulated 3D Scene Graphs for Open-World Mobile Manipulation

Martin Büchner; Adrian Röfer; Tim Engelbracht; Tim Welschehold; Zuria Bauer; Hermann Blum; Marc Pollefeys; Abhinav Valada

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

Martin Büchner, Adrian Röfer, Tim Engelbracht, Tim Welschehold, Zuria Bauer, Hermann Blum, Marc Pollefeys, Abhinav Valada

TL;DR

MoMa-SG addresses the gap between semantic understanding and kinematic prediction for open-world mobile manipulation by building semantic-kinematic 3D scene graphs from in-the-wild RGB-D observations. The pipeline segments interactions, estimates twist-based articulation models in $SE(3)$ with a regularization that disambiguates revolute and prismatic joints, and constructs a hierarchical graph linking articulated parents with contained objects. A new Arti4D-Semantic dataset provides real-world, open-world articulated scenes with per-object articulation axes, contained-object labels, and multiple observation paradigms. Real-world experiments on two mobile manipulators demonstrate robust manipulation guided by the semantic-kinematic graphs, and code/data are released to enable broader adoption.

Abstract

Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

TL;DR

with a regularization that disambiguates revolute and prismatic joints, and constructs a hierarchical graph linking articulated parents with contained objects. A new Arti4D-Semantic dataset provides real-world, open-world articulated scenes with per-object articulation axes, contained-object labels, and multiple observation paradigms. Real-world experiments on two mobile manipulators demonstrate robust manipulation guided by the semantic-kinematic graphs, and code/data are released to enable broader adoption.

Abstract

Paper Structure (34 sections, 23 equations, 18 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 23 equations, 18 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Approach
Interaction Discovery
Articulation Estimation
Articulated 3D Scene Graph Construction
Arti4D-Semantic Dataset
Experimental Evaluation
Interaction Segmentation
Articulation Estimation
Object Understanding
Real-World Mobile Manipulation
Limitations
Conclusion
Additional Details on MoMa-SG
...and 19 more sections

Figures (18)

Figure 1: MoMa-SG enables the construction of accurate 3D scene graphs over articulated scenes and serves as a backbone for long-horizon mobile manipulation.
Figure 2: MoMa-SG enables the construction of accurate 3D scene graphs over articulated scenes and serves as a backbone for long-horizon mobile manipulation. We first discover interaction segments (\ref{['sec:interaction_disc']}), then attain object articulation models $\mathcal{A}$ by estimating twists from point trajectories (\ref{['sec:articulation_estimation']}). Next, we match mapped objects $\mathcal{O}$ against articulations and discover objects contained in respective articulated parents (\ref{['sec:articulated_scene_graph']}).
Figure 3: Labels contained in Arti4D-Semantic: Solid circled labels denote articulated parent parts and dashed-circled labels represent articulated labels.
Figure 4: Contained objects discovered using MoMa-SG across different scenes of Arti4D-Semantic.
Figure 5: Qualitative results of MoMa-SG on Arti4D-Semantic: Estimated axis positions and corresponding object masks. As demonstrated, we observe minimal errors for a large number of prismatic objects and small errors on revolute objects.
...and 13 more figures

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

TL;DR

Abstract

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)