Table of Contents
Fetching ...

Dense Optical Tracking: Connecting the Dots

Guillaume Le Moing, Jean Ponce, Cordelia Schmid

TL;DR

DOT is shown to be significantly more accurate than current optical flow techniques, outperforms sophis-ticated “universal” trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster.

Abstract

Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .

Dense Optical Tracking: Connecting the Dots

TL;DR

DOT is shown to be significantly more accurate than current optical flow techniques, outperforms sophis-ticated “universal” trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster.

Abstract

Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot .
Paper Structure (34 sections, 1 equation, 13 figures, 4 tables)

This paper contains 34 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: DOT unifies point tracking and optical flow techniques. From a few initial tracks, it predicts dense motions and occlusions from source to target frames. We represent tracks in white, occlusions with stripes, and motion directions using distinctive colors.
  • Figure 2: Dense optical tracking. Our approach, DOT, takes a video as input and produces dense motion information between any pair of source and target frames $X_s$ / $X_t$ as an optical flow map $F_{{s}\rightarrow{t}}$ and a visibility mask $M_{{s}\rightarrow{t}}$. We first track the 2D position and the visibility ([0.5]$\times$: visible, [0.5]$\circ$: occluded) of a small set of physical points throughout the video. These are sampled preferably from key regions at motion boundaries (shown in grey). We deduce motion estimates $F^0_{{s}\rightarrow{t}}$ / $M^0_{{s}\rightarrow{t}}$ by using all the tracks whose associated point is visible at $s$, noted $V_s$, to initialize their nearest neighbors, forming Voronoi cells. We finally refine these estimates with optical flow techniques, using the frames $X_s$ and $X_t$.
  • Figure 3: Qualitative samples on the CVO benchmark. We show the predicted flow (1st / 3rd rows) and visibility mask (2nd / 4th rows) between the first and last frame of different videos of the Final test set. We also report inference time. Optical flow methods produce smooth motion estimates but miss important object regions. Point tracking methods, PIPS++ and TAPIR, are more accurate but tend to produce noisy estimates. CoTracker improves on this aspect by processing multiple point tracks simultaneously instead of one at a time, but we still observe some artifacts when zooming in. DOT combines the benefits of both optical flow and point tracking approaches.
  • Figure 4: Performance vs speed on the CVO benchmark. DOT reaches different trade-offs by setting the number $N$ of initial tracks to different values in $[256,512,1024,2048,4096,8192]$. We observe that our method improves over all approaches while keeping a speed similar to state-of-the-art optical flow techniques.
  • Figure 5: Qualitative samples on the TAP benchmark. We compare various methods by tracking all points in the first frame of videos from the DAVIS dataset. Only foreground points are visualized, each depicted with distinctive colors, and overlayed with white stripes when occluded. We also indicate the time for each method to process a 480p video of 50 frames on an NVIDIA V100 GPU. In the "Hike" video, our method, DOT, stands out by successfully tracking both legs as the person walks. DOT has robust performance under occlusion, as shown in the "Duck" video where the animal changes sides. In contrast, MFT lose sight of the object, showing the limitations of optical flow methods under occlusion. OmniMotion does not account for the rotation of the object. CoTracker successfully tracks the object but fails to predict occlusions, showing the limitations of point tracking methods overly reliant on local features, especially when different parts of an object look similar. DOT handles videos with small objects or atmospheric effects like smoke, like in the "Drift" video. Other methods tend to miss object parts in similar conditions. Please zoom in for details and refer to the videos in the supplemental materials.
  • ...and 8 more figures