Table of Contents
Fetching ...

Local All-Pair Correspondence for Point Tracking

Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, Joon-Young Lee

TL;DR

LocoTrack introduces a local all-pair correspondence framework for point tracking that leverages dense 4D correlations within a restricted region to resolve matching ambiguities, paired with a lightweight correlation encoder. A length-generalizable Transformer then aggregates temporal information for robust long-range tracking, enabling high accuracy with substantially faster inference than prior state-of-the-art methods. The approach achieves unmatched performance on TAP-Vid benchmarks while maintaining real-time efficiency, demonstrating strong robustness in homogeneous and occluded scenes. This work significantly advances point tracking by marrying dense correspondence priors with efficient, scalable temporal modeling suitable for long videos and varied resolutions.

Abstract

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.

Local All-Pair Correspondence for Point Tracking

TL;DR

LocoTrack introduces a local all-pair correspondence framework for point tracking that leverages dense 4D correlations within a restricted region to resolve matching ambiguities, paired with a lightweight correlation encoder. A length-generalizable Transformer then aggregates temporal information for robust long-range tracking, enabling high accuracy with substantially faster inference than prior state-of-the-art methods. The approach achieves unmatched performance on TAP-Vid benchmarks while maintaining real-time efficiency, demonstrating strong robustness in homogeneous and occluded scenes. This work significantly advances point tracking by marrying dense correspondence priors with efficient, scalable temporal modeling suitable for long videos and varied resolutions.

Abstract

We introduce LocoTrack, a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. Previous approaches in this task often rely on local 2D correlation maps to establish correspondences from a point in the query image to a local region in the target image, which often struggle with homogeneous regions or repetitive features, leading to matching ambiguities. LocoTrack overcomes this challenge with a novel approach that utilizes all-pair correspondences across regions, i.e., local 4D correlation, to establish precise correspondences, with bidirectional correspondence and matching smoothness significantly enhancing robustness against ambiguities. We also incorporate a lightweight correlation encoder to enhance computational efficiency, and a compact Transformer architecture to integrate long-term temporal information. LocoTrack achieves unmatched accuracy on all TAP-Vid benchmarks and operates at a speed almost 6 times faster than the current state-of-the-art.
Paper Structure (30 sections, 7 equations, 9 figures, 6 tables)

This paper contains 30 sections, 7 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Evaluating LocoTrack against state-of-the-art methods. We compare our LocoTrack against other SOTA methods karaev2023cotrackerdoersch2023tapir in terms of model size (circle size), accuracy (y-axis), and throughput (x-axis). LocoTrack shows exceptionally high precision and efficiency.
  • Figure 2: Illustration of our core component. Our local all-pair formulation, achieved with local 4D correlation, demonstrates robustness against matching ambiguity. This contrasts with previous works harley2022particledoersch2023tapirkaraev2023cotrackervecerik2023robotap that rely on point-to-region correspondences, achieved with local 2D correlation, which are susceptible to the ambiguity.
  • Figure 3: Overall architecture of LocoTrack. Our model comprises two stages: track initialization and track refinement. The track initialization stage determines a rough position by conducting feature matching with global correlation. The track refinement stage iteratively refines the track by processing the local 4D correlation.
  • Figure 4: Visualization of correspondence. We visualize the correspondences established between the query and target regions. Our refined 4D correlation (e) demonstrates a clear reduction in matching ambiguity and yields better correspondences compared to the noisy results produced by 2D correlation (d). This improvement aligns closely with the ground truth (c).
  • Figure 5: Local 4D correlation encoder.
  • ...and 4 more figures