Table of Contents
Fetching ...

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Yash Bhalgat, Vadim Tschernezki, Iro Laina, João F. Henriques, Andrea Vedaldi, Andrew Zisserman

TL;DR

This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome obstacles and achieves superior performance compared to state-of-the-art 2D approaches.

Abstract

Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by $7$ points in Association Accuracy (AssA) and $4.5$ points in IDF1 score, while reducing the number of ID switches by $73\%$ to $80\%$ across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

TL;DR

This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome obstacles and achieves superior performance compared to state-of-the-art 2D approaches.

Abstract

Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by points in Association Accuracy (AssA) and points in IDF1 score, while reducing the number of ID switches by to across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.
Paper Structure (42 sections, 11 equations, 5 figures, 6 tables)

This paper contains 42 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the proposed method for 3D-aware object tracking in egocentric videos. The method begins by taking image-level segments and object tracks from a pre-trained video object segmentation model, which are then lifted to 3D using per-frame depth estimates and scene geometry. These segments are fused across time with a 3D-aware tracking cost formulation to refine and maintain consistent object identities throughout the video sequence, even when the objects go out of sight (indicated by ).
  • Figure 2: Qualitative comparison between our method and DEVA cheng2023tracking. We show instance segmentations for selected reference objects. Our method maintains consistent tracks despite viewpoint changes and objects going out of view, while DEVA's tracks break. Our approach successfully segments the pot even when in motion.
  • Figure 3: HOTA and Association accuracy (AssA) metrics across different IoU thresholds.
  • Figure 4: Sensitivity analysis of HOTA performance to hyperparameters. Each vertical axis represents a hyperparameter ($\alpha_s,\alpha_l,\alpha_v,\alpha_c$) or the HOTA metric (rightmost axis). Colored lines show individual configurations, where intersections with the vertical axes indicating parameter values and resulting HOTA scores.
  • Figure 5: Qualitative results demonstrating the quality of object reconstructions and amodal segmentations obtained using our 3D-aware tracking method. The "Reference RGB" column show an image containing the referred object unoccluded. Last 4 columns show the resulting amodal segmentations of the object in red masks with a red border.