Table of Contents
Fetching ...

AllTracker: Efficient Dense Point Tracking at High Resolution

Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Ambrus, Katerina Fragkiadaki, Leonidas J. Guibas

TL;DR

AllTracker reframes long-range point tracking as dense, multi-frame optical flow between a query frame and all other frames, enabling all-pixel trajectories at high resolution. The method combines a ConvNeXt-based encoder, multi-scale appearance correlation, and iterative refinement with pixel-aligned temporal attention over sliding windows to produce dense flow and visibility estimates. It achieves state-of-the-art performance on dense high-resolution point-tracking benchmarks, while remaining memory- and speed-efficient enough for near real-time inference, and it benefits from joint training on optical flow and point-tracking data. The work highlights the practical value of dense, long-range tracking and provides extensive ablations and strong empirical results, while noting limitations in short-range motion estimation and advocating future work on larger temporal contexts and physics-informed constraints.

Abstract

We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on optical flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at https://alltracker.github.io

AllTracker: Efficient Dense Point Tracking at High Resolution

TL;DR

AllTracker reframes long-range point tracking as dense, multi-frame optical flow between a query frame and all other frames, enabling all-pixel trajectories at high resolution. The method combines a ConvNeXt-based encoder, multi-scale appearance correlation, and iterative refinement with pixel-aligned temporal attention over sliding windows to produce dense flow and visibility estimates. It achieves state-of-the-art performance on dense high-resolution point-tracking benchmarks, while remaining memory- and speed-efficient enough for near real-time inference, and it benefits from joint training on optical flow and point-tracking data. The work highlights the practical value of dense, long-range tracking and provides extensive ablations and strong empirical results, while noting limitations in short-range motion estimation and advocating future work on larger temporal contexts and physics-informed constraints.

Abstract

We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on optical flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at https://alltracker.github.io

Paper Structure

This paper contains 45 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: AllTracker estimates high-resolution optical flow between a "query frame" and every other frame of a video, using a sliding-window strategy. Point samples from these outputs can be interpreted as long-term point trajectories.
  • Figure 2: AllTracker architecture. First, we compute feature maps for all frames, and copy the zeroth (query) feature map to every timestep, and compute multi-scale cost volumes. Then, we iterate a recurrent module, which references the query feature map and cost volume pyramid at each timestep, and estimates a low-resolution correspondence field, using interleaved 2D convolutions and pixel-aligned temporal attentions. The output of the RNN is upsampled into high-resolution optical flow maps, which relate all pixels of the zeroth frame to every other frame.
  • Figure 3: AllTracker (top right corner) delivers accurate multi-frame tracks at the throughput of an optical flow model.
  • Figure 4: AllTracker produces accurate displacement fields across dozens of frames. Prior optical flow methods struggle to make correspondences across wide time gaps, while our model uses temporal priors to resolve the ambiguity; prior point trackers take multiple minutes to produce output at this density, and show splotchy pattern errors, while our method produces coherent output in less than a second.
  • Figure 5: Detailed view of iterative refinement block. We consolidate data from visibility, confidence, correlation, motion, and appearance features into a single feature map, then interleave convolutional spatial blocks and pixel-aligned temporal blocks, and output revisions to the features, visibility, confidence, and motion. This refinement process is iterated 4 times (with shared weights).
  • ...and 2 more figures