Table of Contents
Fetching ...

EgoPoints: Advancing Point Tracking for Egocentric Videos

Ahmad Darkhalil, Rhodri Guerrier, Adam W. Harley, Dima Damen

TL;DR

This work introduces EgoPoints, the first dense point-tracking benchmark tailored to egocentric videos, featuring 517 sequences with 4.7K tracks and new metrics to quantify in-view, out-of-view, and re-identification performance. To address the observed deficiencies, the authors propose K-EPIC, a semi-real training pipeline that fuses scene points from EPIC Fields with dynamic-object points from Kubric, generating 11K sequences and 22.1M tracks for robust fine-tuning. Empirical results show that fine-tuning state-of-the-art trackers (notably CoTracker and PIPs++) on K-EPIC improves EgoPoints performance across multiple metrics, while preserving accuracy on traditional third-person benchmarks; however, re-identification remains a central challenge with substantial headroom for improvement. Overall, EgoPoints provides a valuable benchmark and data-generation approach that promotes progress in egocentric dense point tracking, with practical implications for human-robot collaboration and augmented reality.

Abstract

We introduce EgoPoints, a benchmark for point tracking in egocentric videos. We annotate 4.7K challenging tracks in egocentric sequences. Compared to the popular TAP-Vid-DAVIS evaluation benchmark, we include 9x more points that go out-of-view and 59x more points that require re-identification (ReID) after returning to view. To measure the performance of models on these challenging points, we introduce evaluation metrics that specifically monitor tracking performance on points in-view, out-of-view, and points that require re-identification. We then propose a pipeline to create semi-real sequences, with automatic ground truth. We generate 11K such sequences by combining dynamic Kubric objects with scene points from EPIC Fields. When fine-tuning point tracking methods on these sequences and evaluating on our annotated EgoPoints sequences, we improve CoTracker across all metrics, including the tracking accuracy $δ^\star_{\text{avg}}$ by 2.7 percentage points and accuracy on ReID sequences (ReID$δ_{\text{avg}}$) by 2.4 points. We also improve $δ^\star_{\text{avg}}$ and ReID$δ_{\text{avg}}$ of PIPs++ by 0.3 and 2.8 respectively.

EgoPoints: Advancing Point Tracking for Egocentric Videos

TL;DR

This work introduces EgoPoints, the first dense point-tracking benchmark tailored to egocentric videos, featuring 517 sequences with 4.7K tracks and new metrics to quantify in-view, out-of-view, and re-identification performance. To address the observed deficiencies, the authors propose K-EPIC, a semi-real training pipeline that fuses scene points from EPIC Fields with dynamic-object points from Kubric, generating 11K sequences and 22.1M tracks for robust fine-tuning. Empirical results show that fine-tuning state-of-the-art trackers (notably CoTracker and PIPs++) on K-EPIC improves EgoPoints performance across multiple metrics, while preserving accuracy on traditional third-person benchmarks; however, re-identification remains a central challenge with substantial headroom for improvement. Overall, EgoPoints provides a valuable benchmark and data-generation approach that promotes progress in egocentric dense point tracking, with practical implications for human-robot collaboration and augmented reality.

Abstract

We introduce EgoPoints, a benchmark for point tracking in egocentric videos. We annotate 4.7K challenging tracks in egocentric sequences. Compared to the popular TAP-Vid-DAVIS evaluation benchmark, we include 9x more points that go out-of-view and 59x more points that require re-identification (ReID) after returning to view. To measure the performance of models on these challenging points, we introduce evaluation metrics that specifically monitor tracking performance on points in-view, out-of-view, and points that require re-identification. We then propose a pipeline to create semi-real sequences, with automatic ground truth. We generate 11K such sequences by combining dynamic Kubric objects with scene points from EPIC Fields. When fine-tuning point tracking methods on these sequences and evaluating on our annotated EgoPoints sequences, we improve CoTracker across all metrics, including the tracking accuracy by 2.7 percentage points and accuracy on ReID sequences (ReID) by 2.4 points. We also improve and ReID of PIPs++ by 0.3 and 2.8 respectively.

Paper Structure

This paper contains 9 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Sample sequences from EgoPoints, with dense points in reference frame (left) tracked through head motion where both scene points and dynamic object points leave the field of view and and return during the sequence -- e.g. in the first row, the salt bottle is used in a different part of the scene then returned back to shelf. We show qualitative results of CoTracker karaev2023cotracker, before and after fine-tuning with our synthetic sequences combining Kubric and EPIC Fields points (K-EPIC). Fine-tuning increases the number of re-identified.
  • Figure 2: Example of sparsely annotated sequences from EgoPoints benchmark (annotated at 1080p full res images). We expand the point/pixel radius expanded for purposes of visualisation. The dashed lines represent dynamic object tracks, while solid lines show scene point tracks.
  • Figure 3: Visualisation of three points tracked over three frames classified by the metrics in EgoPoints. IV: in-view, OOV: out-of-view, ReID: Re-identification (in-view after being out-of-view). $\checkmark$: correctly tracked, $\text{x}$: incorrectly tracked.
  • Figure 4: Examples of re-identification failures in state-of-the-art models. Each row represents a particular video. The top sequence is 305 frames long, whilst the bottom sequence is 994 frames long.
  • Figure 5: The pipeline for K-EPIC. This includes projecting 3D points as tracks and filtering them using CoTracker to get scene points (left). Additionally, we sample 3D objects and tracks from TAP-Vid-KUBRIC (top right). These are combined to produce K-EPIC sequences with ground-truth point tracking. The number of sampled points and brightness of the images are decreased for visualisation purposes.
  • ...and 4 more figures