Table of Contents
Fetching ...

Self-Supervised Any-Point Tracking by Contrastive Random Walks

Ayush Shrivastava, Andrew Owens

TL;DR

A global matching transformer is trained to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph.

Abstract

We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

Self-Supervised Any-Point Tracking by Contrastive Random Walks

TL;DR

A global matching transformer is trained to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph.

Abstract

We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.
Paper Structure (12 sections, 5 equations, 5 figures, 2 tables)

This paper contains 12 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Global Matching Random Walks. We present a self-supervised method for tracking all physical points over the course of a video, i.e., the Tracking Any Point problem doersch2022tap. Our model uses a global matching transformer xu2022gmflow to track points cycle consistently over time, using the contrastive random walk jabri2020space. Our approach outperforms self-supervised tracking methods, such as self-supervised DIFT tang2023emergent and supervised optical flow methods, like RAFT teed2020raft, on the TAP-Vid benchmark doersch2022tap.
  • Figure 2: Model Architecture. Our model takes a pair of images $I_t$ and $I_{t+1}$ as input over which it computes correspondences. We extract visual features from a CNN, add positional encodings, and pass them as tokens to our global matching transformer. The transformer consisting of 6 stacked layers of self-attention, cross-attention and feed-forward networks, processes these features and produces correlated features $F_t$ and $F_{t+1}$. We compute self-attention over $F_t$ and $F_{t+1}$ and use the attention as the transition matrix for performing contrastive random walks. To compute tracks during evaluation, we can take an expectation over the affinity matrix to get coordinates $(x, y)$.
  • Figure 3: Label Warping. We propose label warping as a remedy to avoid shortcut solutions that arise when we use transformer-based models for contrastive random walks. Instead of warping the last feature to match the first feature, we propose to warp the label used for cycle consistency. For an image pair $I_{t}, I_{t+1}$, we apply different affine transformations $T^f$, $T^b$ to the forward and backward cycle. We then compute $A_{t}^{t+1}$, $A_{t+1}^{t}$ and chain them together to get the affinity matrix for cycle consistency. We then supervise it with the warped identity matrix $T_{f}^{b}(I)$ where $T_f^b$ represents the transformation to go from $T^f$ to $T^b$.
  • Figure 4: Qualitative results. We show qualitative results for TapVid-DAVIS videos and compare them with DIFT and RAFT. DIFT relies on semantic correspondences and often loses the point of interest when motion occurs in the video. RAFT produces accurate movements for several tracks but suffers from drifting of points when the predictions are chained over a long period. In the first video, our method can track points accurately over the long timesteps. RAFT, on the other hand, loses current locations for 2 query points and latches on points on the ground and starts tracking them. In the other 2 videos as well, our method works better than RAFT and DIFT. DIFT produces inaccurate tracks that do not capture motion well. RAFT being accurate most of the time, loses track of points close to the boundary.
  • Figure 5: Optical flow visualization. Although our method is not trained for the optical flow prediction task, it is able to produce reasonable flow outputs over multiple timesteps. RAFT produces high quality flows as it is an optical flow method trained for this objective. DIFT predicts inaccurate flow which are spotty in nature, suggesting that it relies on finding semantic correspondence for certain points in the image, instead of relying local motion cues.