MV-TAP: Tracking Any Point in Multi-View Videos

Jahyeok Koo; Inès Hyeonsu Kim; Mungyeom Kim; Junghyun Park; Seohyun Park; Jaeyeong Kim; Jung Yi; Seokju Cho; Seungryong Kim

MV-TAP: Tracking Any Point in Multi-View Videos

Jahyeok Koo, Inès Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyun Park, Jaeyeong Kim, Jung Yi, Seokju Cho, Seungryong Kim

TL;DR

MV-TAP introduces a cross-view attentive framework for tracking points across synchronized multi-view videos by leveraging camera geometry and local 4D correlations. It encodes view geometry with Plücker coordinates, processes tokens through a multi-view spatio-temporal transformer with temporal, spatial, and view attention, and iteratively refines trajectories and occlusion states. A large synthetic Kubric-based dataset and multi-view Harmony4D evaluation demonstrate that MV-TAP outperforms single-view trackers and depth-reliant baselines, particularly under occlusions. The work defines multi-view point tracking in pixel space and provides a principled baseline with strong generalization for future research in robust multi-view tracking.

Abstract

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

MV-TAP: Tracking Any Point in Multi-View Videos

TL;DR

Abstract

MV-TAP: Tracking Any Point in Multi-View Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)