Table of Contents
Fetching ...

CoWTracker: Tracking by Warping instead of Correlation

Zihang Lai, Eldar Insafutdinov, Edgar Sucar, Andrea Vedaldi

TL;DR

CoWTracker tackles dense point tracking by removing cost volumes and adopting a warp-based refinement paradigm. It iteratively warps target-frame features to the query frame and uses a spatio-temporal transformer to jointly reason over all tracks, producing high-resolution dense trajectories with linear scaling in resolution and iterations. The approach achieves state-of-the-art results on TAP-Vid and RoboTAP, while also delivering competitive zero-shot optical-flow performance on Sintel, KITTI, and Spring, highlighting a promising unification of tracking and optical flow. Practically, the method enables high-detail tracking at higher resolutions with modest computational overhead, suggesting warp-centric architectures as a viable path for scalable dense matching and motion estimation.

Abstract

Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.

CoWTracker: Tracking by Warping instead of Correlation

TL;DR

CoWTracker tackles dense point tracking by removing cost volumes and adopting a warp-based refinement paradigm. It iteratively warps target-frame features to the query frame and uses a spatio-temporal transformer to jointly reason over all tracks, producing high-resolution dense trajectories with linear scaling in resolution and iterations. The approach achieves state-of-the-art results on TAP-Vid and RoboTAP, while also delivering competitive zero-shot optical-flow performance on Sintel, KITTI, and Spring, highlighting a promising unification of tracking and optical flow. Practically, the method enables high-detail tracking at higher resolutions with modest computational overhead, suggesting warp-centric architectures as a viable path for scalable dense matching and motion estimation.

Abstract

Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.
Paper Structure (40 sections, 6 equations, 12 figures, 5 tables)

This paper contains 40 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: CoWTracker results. Our tracker produces dense, long-range point tracks in diverse real-world scenes. It reliably follows humans undergoing rapid motion (a,b,c,e), animals under repeated occlusions and background clutter (d), and vehicles in challenging outdoor settings (b,f). Corner images show the query points from frame 0. Results are subsampled by a factor of 8 (showing only 1/64 of the predicted points).
  • Figure 2: Dense point tracking on a challenging roller-coaster scene. (a) Initial grid of query points (8$\times$ subsampled). (b) AllTracker harley25alltracker: struggles with the thin strcutures, large viewpoint changes, and occlusions, and fails to track the front half of the coaster (see zoom-in). (c) CoWTracker accurately follows the front segment and maintains accurate tracks along the coaster tracks, even near boundaries and through occlusions.
  • Figure 3: Left: CoWTracker Pipeline. The backbone extracts video features from the input video, and a lightweight update operator (see right for details) iteratively warps and refines tracks to yield dense trajectories, visibility, and confidence. Right: Update operator. Warped query/target features, hidden states, and current track estimates are fused by a spatial-temporal transformer to predict residual motion and update hidden states.
  • Figure 4: Tracking through a challenging BMX sequence with a full occlusion in the middle frames. Rows compare DELTA, AllTracker, and Ours. Our method maintains a consistent track before, during, and after occlusion, whereas DELTA loses the target and fails to recover and AllTracker exhibits noticeable drift and fragmentation. Our warp-based indexing queries features at high resolution, preserving fine details and enabling accurate localization after occlusion. Numbers in lower-right boxes indicate frame numbers.
  • Figure 5: Optical flow predictions on MPI Sintel using the same model as our point-tracking results. The predicted flows closely match the ground truth even in difficult scenarios—large motion, occlusions, and background clutter. Note that the model was not trained on any optical-flow datasets, including Sintel.
  • ...and 7 more figures