Table of Contents
Fetching ...

EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

Xiao Wang, Xingxing Xiong, Jinfeng Gao, Xufeng Lou, Bo Jiang, Si-bao Chen, Yaowei Wang, Yonghong Tian

TL;DR

This work addresses the lack of large-scale, semantically rich benchmarks for event stream-based visual place recognition (VPR) by introducing EPRBench, a high-definition dataset with $10K$ event sequences and $65K$ frames (at $1280 \times 720$) plus LLM-generated scene descriptions refined by humans. It also proposes SG-VPR, a semantic-guided multi-modal VPR framework that fuses asynchronous event streams with textual descriptions via a text-guided top-$k$ token selection and multi-scale pooling, accompanied by an auxiliary LLM decoder for interpretability. The authors benchmark 15 state-of-the-art VPR methods on EPRBench and NYC-Event-VPR, and demonstrate state-of-the-art performance on EPRBench ($R@1 = 94.3\%$) and strong cross-modal results on NYC-Event-VPR, highlighting improved robustness to viewpoint changes and environmental noise. The work contributes a valuable resource and a scalable, interpretable paradigm for cross-modal event-based localization with practical impact for robust, GPS-denied navigation.

Abstract

Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

TL;DR

This work addresses the lack of large-scale, semantically rich benchmarks for event stream-based visual place recognition (VPR) by introducing EPRBench, a high-definition dataset with event sequences and frames (at ) plus LLM-generated scene descriptions refined by humans. It also proposes SG-VPR, a semantic-guided multi-modal VPR framework that fuses asynchronous event streams with textual descriptions via a text-guided top- token selection and multi-scale pooling, accompanied by an auxiliary LLM decoder for interpretability. The authors benchmark 15 state-of-the-art VPR methods on EPRBench and NYC-Event-VPR, and demonstrate state-of-the-art performance on EPRBench () and strong cross-modal results on NYC-Event-VPR, highlighting improved robustness to viewpoint changes and environmental noise. The work contributes a valuable resource and a scalable, interpretable paradigm for cross-modal event-based localization with practical impact for robust, GPS-denied navigation.

Abstract

Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
Paper Structure (21 sections, 7 equations, 6 figures, 5 tables)

This paper contains 21 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a). An illustration of event stream-based VPR; (b). Comparison between existing VPR frameworks and ours.
  • Figure 2: Data samples from our newly proposed EPRBench dataset. We visualize two distinct samples (a, b) for each scene category (Campus, Park, Road). For every sample, three images from different viewpoints are presented, demonstrating the significant viewpoint variations contained in our dataset.
  • Figure 3: The overall pipeline of constructing the specialized scenario description model.We first synthesize Chinese CoT data from paired images, then rigorously translate and refine it into English using DeepSeek to ensure semantic accuracy. Finally, this curated dataset is used to fine-tune the Qwen model, enabling it to perform robust scene reasoning and description.
  • Figure 4: An overview of our proposed Semantic Guided VPR framework for event-based place recognition (SG-VPR). The dual-stream system processes event streams and textual descriptions in parallel, adopting a text-guided top-$k$ token selection strategy to extract robust multi-modal representations, with an auxiliary LLM decoder enhancing feature interpretability for retrieval.
  • Figure 5: Visualization of qualitative results. The leftmost column shows the queries, followed by the retrieval results from our SG-VPR and four baselines. Green/Red borders denote success/failure. Our method achieves correct retrieval across all samples, outperforming competitors that struggle with large viewpoint shifts and environmental noise.
  • ...and 1 more figures