Table of Contents
Fetching ...

EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

Adam D. Hines, Gokul B. Nair, Nicolás Marticorena, Michael Milford, Tobias Fischer

TL;DR

This work presents EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition and demonstrates the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera.

Abstract

Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/

EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

TL;DR

This work presents EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition and demonstrates the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera.

Abstract

Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/
Paper Structure (20 sections, 19 equations, 4 figures, 5 tables)

This paper contains 20 sections, 19 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Schematic overview of EventGeM. Polarity histogram frames are passed through a vision transformer Yang2023 (ViT) and generalized mean pooling Radenović2018 (GeM) layer for global features and initial place predictions. Multi-channel time surface (MCTS) representations are passed through SuperEvent Burkhardt2025 for local feature detection and 2D-homography based re-ranking via RANSAC. An optional second re-ranking step uses tencode representations are used with Depth AnyEvent Bartolomei2025 to produce depth estimations for further refinement of matches using a structural similarity index metric (SSIM).
  • Figure 2: Example event frames and attention maps generated after GeM pooling from the pre-trained ViT backbone Yang2023 for the datasets used in this work---Brisbane-Event-VPR Fischer2020, NSAVP Carmichael2025, and Fast-and-Slow Nair2024. Query and reference matches with keypoint descriptors used for 2D-homography based re-ranking with RANSAC. Example query and database matches using EventGeM-D from depth maps generated by Depth AnyEvent Bartolomei2025 using SSIM re-ranking after keypoint RANSAC, with the highest similarity score indicating a correct match.
  • Figure 3: EventGeM, EventGeM-D, and baseline method runtime per-query plotted against the average Recall@1 for the Brisbane-Event-VPR Fischer2020 dataset. Values are the average recall and runtime performance with Sunset2 as the reference against Sunset1, Morning, and Daytime.
  • Figure 4: Online deployment of EventGeM for real-time localization on a robotic platform. We used an Agile Scout 4-wheeled robot fitted with a DAVIS346 DVS and a Jetson Orin AGX running EventGeM. Events were collected over a 50 msec time window and processed by EventGeM, including the generation of the different representations for the ECDPT+GeM Yang2023 backbone and the SuperEvent Burkhardt2025 keypoint detector. The robot was teleoperated around an indoor environment to capture a reference dataset and a query following the same experimental path. Our results show a strong alignment with the ground truth position, achieving a R@1 over 88% and an average runtime of 24 Hz per query.