Table of Contents
Fetching ...

Tracking Any Point with Frame-Event Fusion Network at High Frame Rate

Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu

TL;DR

An image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions is proposed.

Abstract

Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point's trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24$\%$ on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at https://github.com/ljx1002/FE-TAP.

Tracking Any Point with Frame-Event Fusion Network at High Frame Rate

TL;DR

An image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions is proposed.

Abstract

Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point's trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24 on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at https://github.com/ljx1002/FE-TAP.
Paper Structure (16 sections, 3 equations, 5 figures, 2 tables)

This paper contains 16 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of tracking performance in high-speed motion scenarios: Our method (top right), integrating image and event data, vs. Data-driven methods (top left), which rely on the first image frame and event data.
  • Figure 2: The overview of FE-TAP. EvoFusion module fuses image and event data with different frame rates using an appropriate data selection strategy. The query preparation module computes cost volumes based on the fused feature maps. The iterative update module takes these elements as input and optimizes all point query trajectories in parallel within a sliding window, producing high-frequency point tracks.
  • Figure 3: Qualitative tracking predictions(red) and ground truth tracks(green) for EC dataset (1st, 2nd col) and EDS dataset (3rd, 4th col). We discard predicted trajectories if they deviate significantly from the ground truth trajectory.
  • Figure 4: The comparison of our method and data-driven conf_cvpr_MessikommerFG023 under occlusions
  • Figure 5: (a) Custom-designed image-event synchronization device; We validated the performance of our tracker in real-world driving scenarios, including urban roads (b) and tunnel (c) environments.