Table of Contents
Fetching ...

Solution for Point Tracking Task of ICCV 1st Perception Test Challenge 2023

Hongpeng Pan, Yang Yang, Zhongtian Fu, Yuxuan Zhang, Shian Du, Yi Xu, Xiangyang Ji

TL;DR

The paper tackles robust single-point tracking in TAP-Vid under zero-shot conditions, focusing on static-camera scenarios where TAPIR exhibits static-point jitter and drift. It introduces TAPIR+ with two key components: Multi-granularity Camera Motion Detection to classify camera motion and Confident Moving Region (CMR)-based trajectory prediction to stabilize static points, while preserving TAPIR outputs for moving cameras. Through zero-shot experiments on MOVi-F, TAPIR+ achieves substantial improvements over TAPIR, particularly for static-camera videos (AJ increase of about 2.8) and attains state-of-the-art results in the final test (AJ up to $47.19$ static, $45.78$ moving). This approach enhances robustness and generalization for single-point tracking across camera motions, offering a practical, scalable solution for TAP-Vid-like perception tasks.

Abstract

This report proposes an improved method for the Tracking Any Point (TAP) task, which tracks any physical surface through a video. Several existing approaches have explored the TAP by considering the temporal relationships to obtain smooth point motion trajectories, however, they still suffer from the cumulative error caused by temporal prediction. To address this issue, we propose a simple yet effective approach called TAP with confident static points (TAPIR+), which focuses on rectifying the tracking of the static point in the videos shot by a static camera. To clarify, our approach contains two key components: (1) Multi-granularity Camera Motion Detection, which could identify the video sequence by the static camera shot. (2) CMR-based point trajectory prediction with one moving object segmentation approach to isolate the static point from the moving object. Our approach ranked first in the final test with a score of 0.46.

Solution for Point Tracking Task of ICCV 1st Perception Test Challenge 2023

TL;DR

The paper tackles robust single-point tracking in TAP-Vid under zero-shot conditions, focusing on static-camera scenarios where TAPIR exhibits static-point jitter and drift. It introduces TAPIR+ with two key components: Multi-granularity Camera Motion Detection to classify camera motion and Confident Moving Region (CMR)-based trajectory prediction to stabilize static points, while preserving TAPIR outputs for moving cameras. Through zero-shot experiments on MOVi-F, TAPIR+ achieves substantial improvements over TAPIR, particularly for static-camera videos (AJ increase of about 2.8) and attains state-of-the-art results in the final test (AJ up to static, moving). This approach enhances robustness and generalization for single-point tracking across camera motions, offering a practical, scalable solution for TAP-Vid-like perception tasks.

Abstract

This report proposes an improved method for the Tracking Any Point (TAP) task, which tracks any physical surface through a video. Several existing approaches have explored the TAP by considering the temporal relationships to obtain smooth point motion trajectories, however, they still suffer from the cumulative error caused by temporal prediction. To address this issue, we propose a simple yet effective approach called TAP with confident static points (TAPIR+), which focuses on rectifying the tracking of the static point in the videos shot by a static camera. To clarify, our approach contains two key components: (1) Multi-granularity Camera Motion Detection, which could identify the video sequence by the static camera shot. (2) CMR-based point trajectory prediction with one moving object segmentation approach to isolate the static point from the moving object. Our approach ranked first in the final test with a score of 0.46.
Paper Structure (9 sections, 3 equations, 3 figures, 3 tables)

This paper contains 9 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: In (a), points of different shapes represent prediction results at different moments ($t_0$, $t_1$, $t_2$). At label 3 in (b), the blue dashed box illustrates the phenomenon of jitter around the stationary point, while the orange dashed box reflects the issue of static point pseudo-following. The two phenomena of positional offset occurring when the model predicts these static points are significant factors limiting the model's performance.
  • Figure 2: In the framework, the original video is initially fed into the TAPIR model for preliminary point trajectory predictions. Subsequently, the SSMI metric is computed for both the entire video and its fine-grained clips. Analyzing this metric helps distinguish whether the video was shot with a static camera. If the condition is met, a confident moving region is computed for each frame of the video using a motion detection algorithm, thus correcting points that fall outside the region to the initial points. If the condition is not met, the original model's results are directly output.
  • Figure 3: Original Predictions: Predictions from the original model. Confident moving regions: Predictions of confident moving regions using our method. Final Predictions: The final predictions are obtained by combining the original model with our approach.