Table of Contents
Fetching ...

Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

Bishoy Galoaa, Xiangyu Bai, Shayda Moezzi, Utsav Nandi, Sai Siddhartha Vivek Dhir Rangoju, Somaieh Amraee, Sarah Ostadabbas

TL;DR

This work introduces LAPA, a novel end-to-end transformer-based framework for multi-camera point tracking that unifies detection, cross-view correspondence, and temporal tracking. LAPA uses a distance-based volumetric attention mechanism, integrating epipolar geometry with SfM priors to form soft, differentiable cross-view associations and reconstructs 3D trajectories via a neural triangulation guided by track queries. The method achieves state-of-the-art performance on extended multi-camera datasets TAPVid-3D-MC and PointOdyssey-MC, with substantial improvements in occlusion handling and temporal consistency while maintaining real-time throughput. The results demonstrate the viability of directly modeling 3D point trajectories across camera networks, enabling robust applications in robotics, sports analytics, and markerless motion capture. Calibration robustness and a scalable multi-camera extension protocol further underscore LAPA’s practicality for real-world deployments.

Abstract

This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end transformer-based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a transformer decoder that models long-range dependencies, preserving identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at https://github.com/ostadabbas/Look-Around-and-Pay-Attention-LAPA-

Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

TL;DR

This work introduces LAPA, a novel end-to-end transformer-based framework for multi-camera point tracking that unifies detection, cross-view correspondence, and temporal tracking. LAPA uses a distance-based volumetric attention mechanism, integrating epipolar geometry with SfM priors to form soft, differentiable cross-view associations and reconstructs 3D trajectories via a neural triangulation guided by track queries. The method achieves state-of-the-art performance on extended multi-camera datasets TAPVid-3D-MC and PointOdyssey-MC, with substantial improvements in occlusion handling and temporal consistency while maintaining real-time throughput. The results demonstrate the viability of directly modeling 3D point trajectories across camera networks, enabling robust applications in robotics, sports analytics, and markerless motion capture. Calibration robustness and a scalable multi-camera extension protocol further underscore LAPA’s practicality for real-world deployments.

Abstract

This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end transformer-based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a transformer decoder that models long-range dependencies, preserving identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at https://github.com/ostadabbas/Look-Around-and-Pay-Attention-LAPA-

Paper Structure

This paper contains 33 sections, 16 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Detailed LAPA Architecture: Shows the complete data flow through our five-stage pipeline with volumetric attention grid visualization.
  • Figure 2: LAPA Architecture Overview: Our end-to-end pipeline processes synchronized multi-view frames through (a) 2D point tracking and feature extraction using Co-Tracker and ViT, (b) cross-view correspondence via volumetric grid creation and distance-based geometric attention, (c) distance-based geometric attention, (d) 3D reconstruction by triangulation using track query correspondence and compound feature integration, and (e) final 3D trajectories with consistent point identities across time and views.
  • Figure 2: Camera setup for multi-view tracking with three cameras (red) positioned around the world origin (black X).
  • Figure 3: Distance-based geometric attention mechanism. We compute attention weights directly from spatial distances $d(G_{v_a}[i], P_{v_a,j})$ rather than feature similarities, using $A_{v_a}(i,j) = \text{softmax}(-d^2/T)$ to establish correspondences between projected grid points and detected 2D points.
  • Figure 4: Multi-camera point tracking results demonstrating LAPA's robustness to occlusions and limited camera coverage. We show tracking results on four challenging sequences from TAPVid-3D-MC (Boxes, Juggle, Football, Basketball). Each row presents three synchronized camera views with their corresponding 3D trajectory reconstruction (rightmost). LAPA maintains consistent point identities across all views (shown by consistent colors) by leveraging volumetric attention to aggregate information from all available cameras. The 3D visualizations demonstrate smooth, complete trajectories even when points are occluded or outside individual camera fields of view particularly evident in Basketball where players move between camera frustums and in Boxes where the moving object creates occlusions. This multi-view aggregation enables continuous tracking that would be impossible from any single viewpoint.
  • ...and 1 more figures