Table of Contents
Fetching ...

Ego3DT: Tracking Every 3D Object in Ego-centric Videos

Shengyu Hao, Wenhao Chai, Zhonghan Zhao, Meiqi Sun, Wendi Hu, Jieyang Zhou, Yixian Zhao, Qi Li, Yizhou Wang, Xi Li, Gaoang Wang

TL;DR

This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video, and presents Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment.

Abstract

The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04x - 2.90x in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.

Ego3DT: Tracking Every 3D Object in Ego-centric Videos

TL;DR

This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video, and presents Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment.

Abstract

The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04x - 2.90x in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.

Paper Structure

This paper contains 34 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustrative example of Ego3DT. It showcases robust 3D object tracking across ego-centric video frames (from Frame 1 to Frame 5). The 3D field maintains consistent object information, ensuring the tracking ID remains unchanged. This delivers reliable tracking results in dynamic video scenarios, as shown by the persistent tracking of ID 1 and ID 2 across different viewpoints.
  • Figure 2: Ego3DT framework. (1) 2D Detection & Segmentation: Ego-centric video frames undergo object detection and segmentation using SAM to segment object points and an OV detector to identify objects. (2) Window-level 3D Field: The encoder-decoder structure processes the segmented frames to construct a window-level 3D field. (3) Cross-window Matching and Projection: Subsequent windows are aligned using rotational transforms to maintain object consistency across frames. (4) Global 3D Field: The cumulative data from all windows is integrated to form a global 3D field, with each object assigned a unique ID, facilitating precise object tracking throughout the video sequence.
  • Figure 3: Qualitative results of the 3D tracking field in Ego3DT: a) For the Ego3DT-daily dataset, diverse outdoor objects (IDs 1-7) are successfully tracked within the environment, showing the model's capability to handle varying object types and outdoor conditions. b) In the Ego3DT-indoor dataset, common indoor objects (IDs 1-4) are tracked with high fidelity in a typical room setup, demonstrating the precision of the 3D tracking across different indoor scenes.
  • Figure 4: Qualitative results of 2D tracking comparison: a) Ground Truth sequence showing accurate object detection and consistent ID assignment over time. b) ByteTrack with GLEE detection demonstrating object tracking and identification, with occasional ID inconsistencies and missed detections. c) Our Ego3DT approach maintains stable object identification, accurately captures dynamic objects, and excels in consistent ID assignment, especially in motion-rich ego-centric views. From left to right represents the tracking results of each method over time.