Table of Contents
Fetching ...

Pandora: Articulated 3D Scene Graphs from Egocentric Vision

Alan Yu, Yun Chang, Christopher Xie, Luca Carlone

Abstract

Robotic mapping systems typically approach building metric-semantic scene representations from the robot's own sensors and cameras. However, these "first person" maps inherit the robot's own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot's ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.

Pandora: Articulated 3D Scene Graphs from Egocentric Vision

Abstract

Robotic mapping systems typically approach building metric-semantic scene representations from the robot's own sensors and cameras. However, these "first person" maps inherit the robot's own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot's ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.

Paper Structure

This paper contains 34 sections, 7 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Pandora constructs an articulated 3D scene graph from egocentric data, where we model articulate object parts and their relationships to objects they contain. The estimated articulation models from human interactions are used to build a 3D scene graph, which can be then used for downstream object-retrieval tasks on a mobile robot.
  • Figure 2: Pandora models the Geometric Layer, represented by a 3D voxel grid, and the Object Layer. Objects can be either (1) articulate parts or (2) ordinary objects. Articulate parts (e.g. the fridge door) can constrain the movement of ordinary objects (e.g. the soap bottle and milk carton on the door itself), or only contain them (e.g. the grapes that lie behind the door).
  • Figure 3: (left) For each timestep $t,$ we intersect the hand-sphere with the scene pointcloud (backprojected from the depth map $\mathbf{D}_t$). The counts form a time series, which we post-process to obtain the interaction interval, with endpoints that define the interaction keyframes$\in \mathcal{I}.$ The static keyframes$\in \mathcal{K},$ where there is no interaction, span the intervals in red. (right) From fusing the scene mesh during the static keyframe intervals, we obtain the mesh before and after interaction. The object part is then extracted and used in combination with the hand poses to estimate the articulation model.
  • Figure 4: Pandora is queried to provide the handle grasp point and orientation, along with target end-effector positions. With knowledge of how the constrained object has moved, a target grasp location is proposed and executed.
  • Figure 5: (a) Sample renders from a simulated evaluation scene. The ground truth instance map is used for evaluations for consistency with baselines. (b) Examples of articulated objects in the evaluation scene.