Table of Contents
Fetching ...

Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs

Tjark Behrens, René Zurbrügg, Marc Pollefeys, Zuria Bauer, Hermann Blum

TL;DR

Lost & Found tackles dynamic scene understanding from egocentric observations by updating a transformable 3D scene graph that encodes object‑level relations as changes occur. It fuses a static prior geometry with online $6\mathrm{DoF}$ pose tracking of hand–object interactions, using hand positions, 3D object priors, and 2D hand–object cues to estimate poses in $SE(3)$ during interaction intervals. The key contributions are (i) a dynamic scene graph with relations like 'contains' and 'close to', (ii) an online two‑stage tracking method that detects interaction intervals and estimates poses without depth, and (iii) demonstrations in real rooms showing robustness to egocentric viewpoints and missing depth, including robotic teach‑and‑repeat and drawer‑retrieval tasks. The results show substantial improvements over baselines in translation and rotation accuracy and yield smoother trajectories, enabling downstream robotic capabilities that static maps cannot support.

Abstract

Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost & Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach & repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer. Code, videos and corresponding data are accessible at https://behretj.github.io/LostAndFound.

Lost & Found: Tracking Changes from Egocentric Observations in 3D Dynamic Scene Graphs

TL;DR

Lost & Found tackles dynamic scene understanding from egocentric observations by updating a transformable 3D scene graph that encodes object‑level relations as changes occur. It fuses a static prior geometry with online pose tracking of hand–object interactions, using hand positions, 3D object priors, and 2D hand–object cues to estimate poses in during interaction intervals. The key contributions are (i) a dynamic scene graph with relations like 'contains' and 'close to', (ii) an online two‑stage tracking method that detects interaction intervals and estimates poses without depth, and (iii) demonstrations in real rooms showing robustness to egocentric viewpoints and missing depth, including robotic teach‑and‑repeat and drawer‑retrieval tasks. The results show substantial improvements over baselines in translation and rotation accuracy and yield smoother trajectories, enabling downstream robotic capabilities that static maps cannot support.

Abstract

Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost & Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach & repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer. Code, videos and corresponding data are accessible at https://behretj.github.io/LostAndFound.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Object tracking: Our method allows to track the dynamic information associated with the pick-and-place action of the cow toy (purple). The figure presents a series of snapshots of the employed scene graph data structure at varying time points. The object trajectory extends from the right shelf to the top-left drawer of the cabinet, as indicated by the dotted arrow.
  • Figure 2: Method Overview: We build a static scene graph that captures object-level relationships, given our initial 3D scan with its semantic instance segmentation. Each Aria glasses recording provides hand positions and device poses. With Lost & Found, we identify object interactions by locating them in our 3D prior and simultaneously querying a 2D hand-object tracker. At the beginning of such an interaction, we project 3D points of our object instance onto the image plane. A point tracking method keeps track of these 2D feature points in subsequent observations. While the 3D hand location yields an anchor for the object translation, we can apply a robust perspective-n-point algorithm to the known 2D-3D correspondences for each RGB image, to identify the correct 6DoF pose of the object. The scene graph is updated accordingly to reflect the correct state of the current environment. In the example above, the picture frame (red) is carried from the rack on the right to the top of the tall shelf on the left.
  • Figure 3: Teach & Repeat Experiment. We showcase how Lost & Found can help to record reoccurring motion primitives. In this example (from left to right), a human agent opens the top-left drawer of the small cabinet and grabs the blue toy from the other side of the room. The toy is then stored inside the drawer. To conclude the action, the drawer is closed again. We demonstrate that our method can be seamlessly integrated into robotic systems that are then capable of replaying the tracked interaction (on the right).