TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

Jiaxiong Liu; Zhen Tan; Jinpu Zhang; Yi Zhou; Hui Shen; Xieyuanli Chen; Dewen Hu

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen, Xieyuanli Chen, Dewen Hu

TL;DR

TAPFormer is introduced, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking, and outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold.

Abstract

Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

TL;DR

Abstract

Paper Structure (28 sections, 11 equations, 15 figures, 10 tables)

This paper contains 28 sections, 11 equations, 15 figures, 10 tables.

Introduction
Related Work
Frame-Based Point Tracking Methods
Event-Based Point Tracking Methods
Multimodal Fusion for Events and Frames
Method
Problem Definition
Transient Asynchronous Fusion
Cross-modal Local Weighted Fusion Module
Dataset Overview
EXPERIMENTS
Implementation Details
Task 1: TAP
Task 2: Feature Tracking
Ablation Study
...and 13 more sections

Figures (15)

Figure 1: A qualitative comparison of tracking performance in our dataset shows that (a) frame-based tracking suffers from insufficient temporal information, and (b) event-based tracking fails to capture fine spatial details. In contrast, our (c) fusion approach can recover long-term, high-accuracy point trajectories. The rightmost color bar visualizes the per-point tracking error (in pixels).
Figure 2: TAPFormer overview. (a) The overall framework: frames and events are fused by the transient asynchronous fusion mechanism and Cross-Modal Local Weighted Fusion modules to produce high-frequency transient features, refined by temporal attention and decoded into multi-scale fusion features. The resulting features, together with the initial query points position $\textbf{q}$, are fed into a transformer-based optimization module to iteratively predict tracking trajectories $\textbf{x}$ and occlusion states $v$. $M$ denotes the number of iterations. (b) The fusion network: image and event tokens are integrated by local weighted cross-attention to construct and update transient representations.
Figure 3: Task 1: TAP on InivTAP. Rows show different sequences, and columns correspond to frame-based, event-based, fusion-based (ours), and ground-truth results.
Figure 4: Red boxes show fast-moving vehicles where frame-based tracking drifts; yellow boxes highlight texture-similar regions where event-based tracking fails. Our fusion-based method achieves stable and accurate tracking in both cases.
Figure 5: Task 2: Feature tracking on EDS.
...and 10 more figures

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

TL;DR

Abstract

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

Authors

TL;DR

Abstract

Table of Contents

Figures (15)