Table of Contents
Fetching ...

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen

TL;DR

The paper defines Out of Sight, Not Out of Mind (OSNOM) for egocentric video and proposes Lift, Match, and Keep (LMK) to lift 2D observations to 3D, track objects with appearance and spatial cues, and preserve their 3D locations even when out of view. LMK demonstrates that maintaining 3D object locations yields robust long-range localization on EPIC-KITCHENS, significantly outperforming 2D trackers and recent baselines across minutes-long sequences. The study highlights the importance of object permanence and 3D world-coordinate tracking for spatial cognition in real-world settings and lays groundwork for assistive systems that reason about objects beyond current visibility.

Abstract

As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of their sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce a simple but effective approach to address this challenging problem, called Lift, Match, and Keep (LMK). LMK lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera. We benchmark LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. After 120 seconds, 57% of the objects are correctly localised by LMK, compared to just 33% by a recent 3D method for egocentric videos and 17% by a general 2D tracking method.

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

TL;DR

The paper defines Out of Sight, Not Out of Mind (OSNOM) for egocentric video and proposes Lift, Match, and Keep (LMK) to lift 2D observations to 3D, track objects with appearance and spatial cues, and preserve their 3D locations even when out of view. LMK demonstrates that maintaining 3D object locations yields robust long-range localization on EPIC-KITCHENS, significantly outperforming 2D trackers and recent baselines across minutes-long sequences. The study highlights the importance of object permanence and 3D world-coordinate tracking for spatial cognition in real-world settings and lays groundwork for assistive systems that reason about objects beyond current visibility.

Abstract

As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of their sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce a simple but effective approach to address this challenging problem, called Lift, Match, and Keep (LMK). LMK lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera. We benchmark LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. After 120 seconds, 57% of the objects are correctly localised by LMK, compared to just 33% by a recent 3D method for egocentric videos and 17% by a general 2D tracking method.
Paper Structure (17 sections, 8 equations, 16 figures)

This paper contains 17 sections, 8 equations, 16 figures.

Figures (16)

  • Figure 1: Lifting 2D observations to 3D. We use mask centroids as 2D object locations, sample corresponding depths from the mesh-aligned monocular depth estimate. We then compute the 3D object locations in world coordinates by un-projecting the mask's centroid from the estimated camera pose.
  • Figure 2: 3D Projection error. Distribution of Euclidean distance errors for the same object, at one location, comparing $l_n$ to $l_{n+T}$.
  • Figure 3: OSNOM results. PCL of LMK compared to baselines.
  • Figure 4: Effect of visual appearance and location. PCL for visual features (V), location features (L), or both (V+L).
  • Figure 5: Evaluation thresholds. LMK when increasing the PCL threshold $R$ - the maximum distance between predicted and ground truth 3D locations considered successful.
  • ...and 11 more figures