Table of Contents
Fetching ...

Data-Driven Feature Tracking for Event Cameras With and Without Frames

Nico Messikommer, Carter Fang, Mathias Gehrig, Giovanni Cioffi, Davide Scaramuzza

TL;DR

This work introduces the first data-driven feature tracker for event cameras, addressing frame-based limitations by leveraging the high temporal resolution of events. A novel frame attention module enables information sharing across all tracks, and the system supports both event-only and hybrid event-frame deployments, including aligned and side-by-side stereo configurations for sparse disparity estimation. Training combines synthetic supervision from Multiflow with pose-based self-supervision to bridge sim-to-real gaps, enabling robust performance across EC and EDS datasets. The results show superior tracking performance and significant runtime advantages over state-of-the-art baselines, with extendability to disparity estimation and integration with frame-based trackers for robust VO/SLAM pipelines.

Abstract

Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in an intensity frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. Our tracker is designed to operate in two distinct configurations: solely with events or in a hybrid mode incorporating both events and frames. The hybrid model offers two setups: an aligned configuration where the event and frame cameras share the same viewpoint, and a hybrid stereo configuration where the event camera and the standard camera are positioned side-by-side. This side-by-side arrangement is particularly valuable as it provides depth information for each feature track, enhancing its utility in applications such as visual odometry and simultaneous localization and mapping.

Data-Driven Feature Tracking for Event Cameras With and Without Frames

TL;DR

This work introduces the first data-driven feature tracker for event cameras, addressing frame-based limitations by leveraging the high temporal resolution of events. A novel frame attention module enables information sharing across all tracks, and the system supports both event-only and hybrid event-frame deployments, including aligned and side-by-side stereo configurations for sparse disparity estimation. Training combines synthetic supervision from Multiflow with pose-based self-supervision to bridge sim-to-real gaps, enabling robust performance across EC and EDS datasets. The results show superior tracking performance and significant runtime advantages over state-of-the-art baselines, with extendability to disparity estimation and integration with frame-based trackers for robust VO/SLAM pipelines.

Abstract

Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in an intensity frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. Our tracker is designed to operate in two distinct configurations: solely with events or in a hybrid mode incorporating both events and frames. The hybrid model offers two setups: an aligned configuration where the event and frame cameras share the same viewpoint, and a hybrid stereo configuration where the event camera and the standard camera are positioned side-by-side. This side-by-side arrangement is particularly valuable as it provides depth information for each feature track, enhancing its utility in applications such as visual odometry and simultaneous localization and mapping.
Paper Structure (30 sections, 6 equations, 13 figures, 13 tables)

This paper contains 30 sections, 6 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Our method leverages the high-temporal resolution of events to provide stable feature tracks in high-speed motion in which standard frames suffer from motion blur. To achieve this, we propose a novel frame attention module that combines the information across feature tracks. Our architecture seamlessly extends to sparse disparity estimation for a dual setup including a standard and event camera.
  • Figure 2: As shown in (a), our event tracker takes as input a reference patch $\mathbf{P_0}$ in a grayscale image $I_0$ and an event patch $\mathbf{P}_j$ constructed from an event stream $E_j$ at timestep $t_j$ and predicts the relative feature displacement $\mathbf{\Delta \hat{f}_{j}}$. Each feature is individually processed by a feature network, which uses a ConvLSTM layer with state $F$ to process a correlation map $C_j$ based on a template feature vector $R_0$ and the pixel-wise feature maps of the event patch. To share information across different feature tracks, our novel frame attention module (b) fuses the processed feature vectors for all tracks in an image using self-attention and a temporal state $S$, which is used to compute the final displacement $\mathbf{\Delta \hat{f}_{j}}$.
  • Figure 3: To adapt our tracker to real event data, our self-supervised loss computes a triangulated point based on the predicted track, and the camera poses. The 3D point is then reprojected to each camera plane, and the L1-distance $\ell_j$ between reprojected and predicted point is used as a supervision signal.
  • Figure 4: For the disparity estimation task, our proposed network takes as input a reference patch $\mathbf{P_0}$ in a grayscale image $I_0$ and a rectangular event patch $\mathbf{P}_j$ constructed from an event stream $E_j$ at timestep $t_0$. Similar to the feature tracking task, our novel frame attention module enables the information sharing across different features to compute the final disparity $\mathbf{\hat{d}_{j}}$.
  • Figure 5: Qualitative tracking predictions (blue) and ground truth tracks (green) for the EC dataset (top) and EDS dataset (middle / bottom). Our method predicts more accurate tracks for a higher number of initial features.
  • ...and 8 more figures