Instance Tracking in 3D Scenes from Egocentric Videos

Yunhan Zhao; Haoyu Ma; Shu Kong; Charless Fowlkes

Instance Tracking in 3D Scenes from Egocentric Videos

Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes

TL;DR

This work tackles instance tracking in 3D from egocentric videos (IT3DEgo) by introducing a real-world RGB-D benchmark and a two-pronged enrollment protocol: single-view online enrollment ($ ext{SVOE}$) and multi-view pre-enrollment ($ ext{MVPE}$). It re-purposes existing 2D trackers for 3D lifting and proposes an improved baseline that uses SAM+DINOv2 for robust proposals, augmented with depth/pose information and a Kalman-filter-based motion prior. Experimental results show that leveraging camera pose and depth to operate in world coordinates significantly eases the tracking problem, with 3D guidance enhancing 2D tracking performance and pre-enrollment benefiting from high-quality multi-view templates. The dataset and protocol are poised to drive development of perceptually-aware AR/VR assistive agents while highlighting practical considerations for real-world 3D egocentric tracking.

Abstract

Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Our experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches in the egocentric setting. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.

Instance Tracking in 3D Scenes from Egocentric Videos

TL;DR

This work tackles instance tracking in 3D from egocentric videos (IT3DEgo) by introducing a real-world RGB-D benchmark and a two-pronged enrollment protocol: single-view online enrollment (

) and multi-view pre-enrollment (

). It re-purposes existing 2D trackers for 3D lifting and proposes an improved baseline that uses SAM+DINOv2 for robust proposals, augmented with depth/pose information and a Kalman-filter-based motion prior. Experimental results show that leveraging camera pose and depth to operate in world coordinates significantly eases the tracking problem, with 3D guidance enhancing 2D tracking performance and pre-enrollment benefiting from high-quality multi-view templates. The dataset and protocol are poised to drive development of perceptually-aware AR/VR assistive agents while highlighting practical considerations for real-world 3D egocentric tracking.

Abstract

Paper Structure (17 sections, 11 figures, 5 tables)

This paper contains 17 sections, 11 figures, 5 tables.

Introduction
Related Work
IT3DEgo: Protocol and Dataset
Benchmarking Protocol
Dataset
Methodology
Baseline: Re-purposed SOT Trackers
Improved Baseline
Experiments
Benchmark Results
Further Analysis and Ablation Study
Discussion
Conclusion
Additional Dataset Details
Additional Ablation study
...and 2 more sections

Figures (11)

Figure 1: Motivation for the proposed IT3DEgo benchmark task. We envision the real-world application of an assistive agent that continuously tracks enrolled object instances in 3D and can provide navigation guidance to users to retrieve object instances at any time. Tracked objects are either enrolled online (first row in the library) where objects of interest are identified automatically based on user interactions or pre-enrolled (bottom four rows in the library), where task-relevant objects are modeled from a collection of photos taken from different views. The former setup comes with additional in-context sensor information, such as camera pose and depth while the latter features richer visual information.
Figure 2: Illustration of input and output of our benchmark task. Given a raw RGB-D video sequence with camera poses and object instances of interest, i.e., either by online enrollment (SVOE) or pre-enrollment (MVPE), the goal of our benchmark task is to output the object instance 3D centers in a predefined world coordinate at each timestamp. Please check Section \ref{['sess:protocol']} for more details.
Figure 3: Qualitative visualizations of tracking with SVOE in both 3D space (left) and projected 2D view (right). We visualize three top-performing trackers from different categories, i.e., EgoSTARK, VITKT_M, and SAM+DINOv2. For projected 2D visualization, we compare the projected 3D points of each model w.r.t to the ground-truth annotated 2D bounding boxes. In the 3D view, we show 3 concentric circles at each ground-truth position representing 0.25, 0.5 and 0.75 meter thresholds. In both 2D and 3D visualizations, we find SAM+DINOv2 outperforms others as the predictions are closer to the center of object instances.
Figure 4: Performance comparisons of SAM+DINOv2 with different cosine thresholds. By increasing the threshold, we find the model performance first improves and then gradually decreases. Intuitively, increasing the threshold will initially filter noisy predictions but when the threshold is too large the model will miss correct object 3D location updates.
Figure 5: Illustration of our benchmark dataset. It is collected with HoloLens2 which captures RGB, depth, and four grayscale side views at 30 fps. Additionally, the device also captures per-frame camera poses allowing coarse reconstruction of the surroundings.
...and 6 more figures

Instance Tracking in 3D Scenes from Egocentric Videos

TL;DR

Abstract

Instance Tracking in 3D Scenes from Egocentric Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (11)