Table of Contents
Fetching ...

SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM

Yannick Burkhardt, Simon Schaefer, Stefan Leutenegger

TL;DR

SuperEvent addresses the challenge of learning stable, descriptive keypoints for event cameras by leveraging frame-based pseudo-labels on synchronized grayscale frames, enabling supervision directly on real event data. It introduces Multi-Channel Time Surfaces to capture motion across multiple time scales and a transformer-based backbone to robustly extract features, coupled with a loss that jointly optimizes detection and descriptor matching. Empirically, it achieves state-of-the-art pose estimation on event datasets and significantly improves performance when integrated into frame-based VI-SLAM (OKVIS2), including loop-closure reliability. The work demonstrates the practical viability of event-based SLAM by harnessing existing frame-based detectors and validating across diverse datasets, offering a scalable path toward HDR-robust, high-speed perception systems.

Abstract

Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. Source code is available at https://ethz-mrl.github.io/SuperEvent/.

SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM

TL;DR

SuperEvent addresses the challenge of learning stable, descriptive keypoints for event cameras by leveraging frame-based pseudo-labels on synchronized grayscale frames, enabling supervision directly on real event data. It introduces Multi-Channel Time Surfaces to capture motion across multiple time scales and a transformer-based backbone to robustly extract features, coupled with a loss that jointly optimizes detection and descriptor matching. Empirically, it achieves state-of-the-art pose estimation on event datasets and significantly improves performance when integrated into frame-based VI-SLAM (OKVIS2), including loop-closure reliability. The work demonstrates the practical viability of event-based SLAM by harnessing existing frame-based detectors and validating across diverse datasets, offering a scalable path toward HDR-robust, high-speed perception systems.

Abstract

Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. Source code is available at https://ethz-mrl.github.io/SuperEvent/.

Paper Structure

This paper contains 26 sections, 6 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Detections on the sequence rec1499023756 of the DDD20 Hu20 dataset (not used for training). Top: Pseudo-labels from SuperPoint Det18 and SuperGlue Sar20 on the gray-scale frames. Bottom: Matched keypoints from SuperEvent in the event stream at the corresponding time stamps.
  • Figure 2: Data processing pipeline of SuperEvent: For training, pseudo-labels are generated using a frame-based detector and matcher on the gray-scale frame pairs. We generate spatially synchronized MCTS at the same timestamps and feed them to SuperEvent. The network predictions are compared to the pseudo-labels and the network weights are optimized using backpropagation. During inference, the network predictions can be used to detect keypoints and match their descriptors on the event stream only.
  • Figure 3: Schematic depiction of the MCTS generation process.
  • Figure 4: SuperEvent network architecture and tensor dimensions: a shared transformer backbone is combined with a detector and a descriptor head. The components of the Convolution-Attention blocks (Conv. + Attn.) and the VGG blocks are displayed on the bottom left. Activations are omitted for simplicity.
  • Figure 5: Examples of the training data for temporal matching. Top (orange): pseudo-labels generated by SuperPoint Det18 + SuperGlue Sar20; bottom (green): predictions of SuperEvent after training.
  • ...and 6 more figures