Table of Contents
Fetching ...

MV-TAP: Tracking Any Point in Multi-View Videos

Jahyeok Koo, Inès Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyun Park, Jaeyeong Kim, Jung Yi, Seokju Cho, Seungryong Kim

TL;DR

MV-TAP introduces a cross-view attentive framework for tracking points across synchronized multi-view videos by leveraging camera geometry and local 4D correlations. It encodes view geometry with Plücker coordinates, processes tokens through a multi-view spatio-temporal transformer with temporal, spatial, and view attention, and iteratively refines trajectories and occlusion states. A large synthetic Kubric-based dataset and multi-view Harmony4D evaluation demonstrate that MV-TAP outperforms single-view trackers and depth-reliant baselines, particularly under occlusions. The work defines multi-view point tracking in pixel space and provides a principled baseline with strong generalization for future research in robust multi-view tracking.

Abstract

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

MV-TAP: Tracking Any Point in Multi-View Videos

TL;DR

MV-TAP introduces a cross-view attentive framework for tracking points across synchronized multi-view videos by leveraging camera geometry and local 4D correlations. It encodes view geometry with Plücker coordinates, processes tokens through a multi-view spatio-temporal transformer with temporal, spatial, and view attention, and iteratively refines trajectories and occlusion states. A large synthetic Kubric-based dataset and multi-view Harmony4D evaluation demonstrate that MV-TAP outperforms single-view trackers and depth-reliant baselines, particularly under occlusions. The work defines multi-view point tracking in pixel space and provides a principled baseline with strong generalization for future research in robust multi-view tracking.

Abstract

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

Paper Structure

This paper contains 23 sections, 15 equations, 10 figures, 15 tables, 2 algorithms.

Figures (10)

  • Figure 1: We present MV-TAP (Tracking Any Point in Multi-view Videos), a model designed to effectively integrate information across multiple viewpoints for robust and high-quality point tracking. (a) We visualize the results of MV-TAP on Harmony4D khirodkar2024harmony4d. (b) MV-TAP achieves noticeable gains over other baselines doersch2023tapirkaraev2024cotrackerkaraev2024cotracker3cho2024localzholus2025tapnextxiao2024spatialtrackerzhang2025tapip3d, demonstrating its ability to leverage multi-view information.
  • Figure 2: Motivation. We conceptually contrast our (c) multi-view point tracking with (a) single-view point tracking doersch2023tapirkaraev2024cotrackerkaraev2024cotracker3cho2024localzholus2025tapnextxiao2024spatialtrackerzhang2025tapip3d and (b) multi-view matching lowe2004distinctiverublee2011orbdetone2018superpointsarlin2020supergluesun2021loftr. Our approach simultaneously models both view- and frame-wise interactions to ensure cross-view and temporal consistency.
  • Figure 3: Overall architecture of MV-TAP. Given synchronized multi-view videos, per-view correlation volumes are extracted from a CNN encoder feature for each query point. These correlations are then tokenized and added with camera embedding for relative geometric context across views and temporal embedding. Trajectories and occlusion states are iteratively updated by a Transformer architecture, comprising temporal, spatial, and view attention modules.
  • Figure 4: Qualitative comparison. We visualize results of MV-TAP and a single-view baseline karaev2024cotracker3 on the DexYCB chao2021dexycb and Panoptic Studio joo2015panoptic datasets. While the single-view baseline fails under occlusions and large motions resulting in highly fragmented tracks, MV-TAP demonstrates superior robustness, maintaining consistent trajectories in the challenging scenarios.
  • Figure 5: Visualization of point trajectories obtained by MV-TAP across diverse datasets. We showcase predictions of our model on the DexYCB chao2021dexycb, Panoptic Studio joo2015panoptic, and Harmony4D khirodkar2024harmony4d datasets.
  • ...and 5 more figures