Table of Contents
Fetching ...

3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans

Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, Luca Carlone

TL;DR

3D Dynamic Scene Graphs (DSGs) unify geometry, semantics, and dynamics into a layered scene representation suitable for planning and decision-making. The authors present SPIN, an automatic pipeline that builds DSGs from visual-inertial data, integrating object and dense human mesh detection with place/room parsing. They demonstrate the system in a photo-realistic Unity simulator, showing robustness in crowded scenes and accurate parsing of humans, objects, places, and rooms. The work enables actionable planning, human-robot interaction, long-term autonomy, and scene prediction by providing hierarchical, time-aware scene representations.

Abstract

We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning and decision-making (e.g. spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first fully automatic Spatial PerceptIon eNgine(SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g. places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction. A video abstract is available at https://youtu.be/SWbofjhyPzI

3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans

TL;DR

3D Dynamic Scene Graphs (DSGs) unify geometry, semantics, and dynamics into a layered scene representation suitable for planning and decision-making. The authors present SPIN, an automatic pipeline that builds DSGs from visual-inertial data, integrating object and dense human mesh detection with place/room parsing. They demonstrate the system in a photo-realistic Unity simulator, showing robustness in crowded scenes and accurate parsing of humans, objects, places, and rooms. The work enables actionable planning, human-robot interaction, long-term autonomy, and scene prediction by providing hierarchical, time-aware scene representations.

Abstract

We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning and decision-making (e.g. spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first fully automatic Spatial PerceptIon eNgine(SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g. places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction. A video abstract is available at https://youtu.be/SWbofjhyPzI

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We propose 3D Dynamic Scene Graphs (DSGs) as a unified representation for actionable spatial perception. (a) A DSG is a layered and hierarchical representation that abstracts a dense 3D model (e.g., a metric-semantic mesh) into higher-level spatial concepts (e.g., objects, agents, places, rooms) and models their spatio-temporal relations (e.g., "agent A is in room B at time $t$", traversability between places or rooms). We present a Spatial PerceptIon eNgine (SPIN) that reconstructs a DSG from visual-inertial data, and (a) segments places, structures (e.g., walls), and rooms, (b) is robust to extremely crowded environments, (c) tracks dense mesh models of human agents in real time, (d) estimates centroids and bounding boxes of objects of unknown shape, (e) estimates the 3D pose of objects for which a CAD model is given.
  • Figure 2: Places and their connectivity shown as a graph. (a) Skeleton (places and topology) produced by Oleynikova18iros-topoMap (side view); (b) Room parsing produced by our approach (top-down view); (c) Zoomed-in view; red edges connect different rooms.
  • Figure 3: Structures: exploded view of walls and floor.
  • Figure 4: Human nodes: (a) Input camera image from Unity, (b) SMPL mesh detection and pose/shape estimation using Kolotouros19cvpr-shapeRec, (c) Temporal tracking and consistency checking on the maximum joint displacement between detections.
  • Figure 5: 3D mesh reconstruction (a) without and (b) with dynamic masking.

Theorems & Definitions (2)

  • Remark 1: Planning Queries
  • Remark 2: Composition of DSGs