Table of Contents
Fetching ...

Ensemble-Based Event Camera Place Recognition Under Varying Illumination

Therese Joseph, Tobias Fischer, Michael Milford

TL;DR

The paper tackles robust visual place recognition with event cameras under varying illumination. It introduces an ensemble-based pipeline that fuses across multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions, using late-score fusion and a modified SeqSLAM approach. Key contributions include up to $Recall@1$ improvements on day–night transitions, a comprehensive analysis of binning, polarity, reconstructions, and feature extractors, and an extended sequence matching method with dynamic history. The findings demonstrate the value of combining complementary representations for robustness and provide a framework for future work in event-native features and real-world deployment.

Abstract

Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, developing robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, which only utilise temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (e.g., afternoon, sunset, night), achieving a 57% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, polarity handling, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.

Ensemble-Based Event Camera Place Recognition Under Varying Illumination

TL;DR

The paper tackles robust visual place recognition with event cameras under varying illumination. It introduces an ensemble-based pipeline that fuses across multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions, using late-score fusion and a modified SeqSLAM approach. Key contributions include up to improvements on day–night transitions, a comprehensive analysis of binning, polarity, reconstructions, and feature extractors, and an extended sequence matching method with dynamic history. The findings demonstrate the value of combining complementary representations for robustness and provide a framework for future work in event-native features and real-world deployment.

Abstract

Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, developing robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, which only utilise temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (e.g., afternoon, sunset, night), achieving a 57% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, polarity handling, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of our ensemble-based event camera place recognition pipeline. A) Each bin is reconstructed into a 2D frame using one of several methods: event count (with/without polarity), time surface lagorce_hots_2017, or the learned E2VID model rebecq_high_2019. (B) These reconstructed frames are processed by an ensemble of visual place recognition (VPR) feature extractors (NetVLAD arandjelovic_netvlad_2018, CosPlace berton_rethinking_2022, MixVPR ali-bey_mixvpr_2023, MegaLoc berton_megaloc_2025) to generate global descriptors. (C) Pairwise descriptor similarities are computed, followed by sequence matching on each resulting similarity matrix. (D) Finally, sequence scores are aggregated across the multiple reconstruction methods, feature extractors, and temporal resolutions to enhance recognition performance.
  • Figure 2: Reconstructed event frames under varying illumination conditions. Column 1 shows the corresponding standard camera image. Columns 2–5 show event-based reconstructions: (2) Two-channel Event Count (polarity-separated), (3) Single-channel Event Count (polarity-combined), (4) Two-channel Time Surface (polarity-separated), and (5) Single-channel E2VID reconstruction (learned method). Rows 1–2 show afternoon and night conditions from a Gen4 Prophesee sensor in the NSAVP dataset carmichael_dataset_2025, while rows 3–4 show daytime and sunset conditions from a DAVIS 346 sensor in the Brisbane Event dataset fischer_event-based_2020.
  • Figure 3: Average Recall@1 (AR@1) for each reference–query pair across ensembling strategies and best individual method, shown separately for day and night conditions. Results are averaged over sequence lengths of 10, 20, and 30 at 1 Hz sampling. The combined ensemble aggregates predictions from varied feature extractors, event-to-frame reconstructions and temporal resolutions.
  • Figure 4: Similarity matrices for a sunset–daytime pair with Megaloc feature extractor and varied reconstructions. Columns 1 and 2 show results from E2VID, EventCount, EventCount (no polarity) and time surface reconstructions. Column 3 shows the ensemble of reconstruction methods. Ground-truth, correct, and incorrect matches are annotated.
  • Figure 5: Average Recall@1 versus the number of reference and query frames, evaluated across two binning strategies and varying event slice resolutions. Time-based binning consistently yields the highest recall, with performance improving at higher resolutions (i.e., smaller bin sizes). The total number of frames serves as a proxy for computational cost, as both event-to-frame reconstruction and feature extraction scale linearly with frame count. Consequently, higher-resolution bins incur greater compute cost.
  • ...and 3 more figures