TAP-Vid: A Benchmark for Tracking Any Point in a Video

Carl Doersch; Ankush Gupta; Larisa Markeeva; Adrià Recasens; Lucas Smaira; Yusuf Aytar; João Carreira; Andrew Zisserman; Yi Yang

TAP-Vid: A Benchmark for Tracking Any Point in a Video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang

TL;DR

This work formalizes Tracking Any Point (TAP) and introduces TAP-Vid, a benchmark combining real-world and synthetic videos to evaluate long-term point tracking on deformable surfaces. It presents a semi-automatic annotation pipeline aided by optical flow, and proposes TAP-Net, an end-to-end cost-volume-based tracker trained on synthetic Kubric data that outperforms existing baselines across TAP-Vid datasets. The paper provides extensive dataset analyses, annotation quality assessments, and a cross-dataset comparison to JHMDB, highlighting the framework’s potential for robust motion understanding in diverse, nonrigid scenes. While offering strong results, it also discusses limitations (e.g., liquids, transparency) and outlines avenues for future improvements in occlusion handling, high-resolution tracking, and broader domain transfer.

Abstract

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

TAP-Vid: A Benchmark for Tracking Any Point in a Video

TL;DR

Abstract

Paper Structure (38 sections, 3 equations, 14 figures, 12 tables)

This paper contains 38 sections, 3 equations, 14 figures, 12 tables.

Introduction
Related Work
Dataset Overview
TAP-Vid Datasets
Real-World Dataset Construction
Annotation Interface
Track Assist Algorithm
Evaluation and Metrics
Dataset Analysis
Point trajectory statistics
Evaluation of human annotation quality
Baselines
TAP-Net
Cost Volume
Track Prediction
...and 23 more sections

Figures (14)

Figure 1: The problem of tracking any point (TAP) in a video. The input is a video clip (e.g. 10s long) and a set of query points ($x,y,t$ in the pixel/frame coordinates; shown with double circles). The goal is to predict trajectories ($x,y$ pixel coordinates; coloured lines) over the whole video, indicating the same physical point on the same surface, as well as a binary occlusion indicator (black solid segments) indicating frames where it isn't visible.
Figure 2: Correspondence tasks in videos. Most prior work on motion understanding has involved tracking (1) bounding boxes or (2) segments, which loses information about rotation and deformation; (3) optical flow which analyzes each frame pair in isolation; (4) structure-from-motion inspired physical keypoints which struggle with deformable objects, or (5) semantic keypoints which are chosen by hand for every object of interest. Our task, in contrast, is to Track Any Point on physical surfaces, including those on deformable objects, over an entire video.
Figure 3: The TAP-Vid point tracking datasets. Ground-truth point annotations on two random videos from four point tracking datasets we use for evaluation---TAP-Vid-Kinetics and TAP-Vid-DAVIS containing real-world videos with point annotations collected from humans, the synthetic TAP-Vid-Kubric dataset, and TAP-Vid-RGB-Stacking from simulated robotics environment.
Figure 4: Annotation workflow. There are 3 stages: (1): object selection with bounding-boxes, (2) point annotation through optical-flow based assistance, and (3) iterative refinement and correction.
Figure 5: Point annotation interface and instructions. The interface consists of three components: visualization panel, buttons, and information panels. The instructions consist of six steps which guide annotators to iteratively add points with the help of the track assist algorithm.
...and 9 more figures

TAP-Vid: A Benchmark for Tracking Any Point in a Video

TL;DR

Abstract

TAP-Vid: A Benchmark for Tracking Any Point in a Video

Authors

TL;DR

Abstract

Table of Contents

Figures (14)