EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition
Xiao Wang, Xingxing Xiong, Jinfeng Gao, Xufeng Lou, Bo Jiang, Si-bao Chen, Yaowei Wang, Yonghong Tian
TL;DR
This work addresses the lack of large-scale, semantically rich benchmarks for event stream-based visual place recognition (VPR) by introducing EPRBench, a high-definition dataset with $10K$ event sequences and $65K$ frames (at $1280 \times 720$) plus LLM-generated scene descriptions refined by humans. It also proposes SG-VPR, a semantic-guided multi-modal VPR framework that fuses asynchronous event streams with textual descriptions via a text-guided top-$k$ token selection and multi-scale pooling, accompanied by an auxiliary LLM decoder for interpretability. The authors benchmark 15 state-of-the-art VPR methods on EPRBench and NYC-Event-VPR, and demonstrate state-of-the-art performance on EPRBench ($R@1 = 94.3\%$) and strong cross-modal results on NYC-Event-VPR, highlighting improved robustness to viewpoint changes and environmental noise. The work contributes a valuable resource and a scalable, interpretable paradigm for cross-modal event-based localization with practical impact for robust, GPS-denied navigation.
Abstract
Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
