Table of Contents
Fetching ...

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen, Xieyuanli Chen, Dewen Hu

TL;DR

TAPFormer is introduced, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking, and outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold.

Abstract

Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

TL;DR

TAPFormer is introduced, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking, and outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold.

Abstract

Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io
Paper Structure (28 sections, 11 equations, 15 figures, 10 tables)

This paper contains 28 sections, 11 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: A qualitative comparison of tracking performance in our dataset shows that (a) frame-based tracking suffers from insufficient temporal information, and (b) event-based tracking fails to capture fine spatial details. In contrast, our (c) fusion approach can recover long-term, high-accuracy point trajectories. The rightmost color bar visualizes the per-point tracking error (in pixels).
  • Figure 2: TAPFormer overview. (a) The overall framework: frames and events are fused by the transient asynchronous fusion mechanism and Cross-Modal Local Weighted Fusion modules to produce high-frequency transient features, refined by temporal attention and decoded into multi-scale fusion features. The resulting features, together with the initial query points position $\textbf{q}$, are fed into a transformer-based optimization module to iteratively predict tracking trajectories $\textbf{x}$ and occlusion states $v$. $M$ denotes the number of iterations. (b) The fusion network: image and event tokens are integrated by local weighted cross-attention to construct and update transient representations.
  • Figure 3: Task 1: TAP on InivTAP. Rows show different sequences, and columns correspond to frame-based, event-based, fusion-based (ours), and ground-truth results.
  • Figure 4: Red boxes show fast-moving vehicles where frame-based tracking drifts; yellow boxes highlight texture-similar regions where event-based tracking fails. Our fusion-based method achieves stable and accurate tracking in both cases.
  • Figure 5: Task 2: Feature tracking on EDS.
  • ...and 10 more figures