Table of Contents
Fetching ...

Event-Driven Neuromorphic Vision Enables Energy-Efficient Visual Place Recognition

Geoffroy Keime, Nicolas Cuperlier, Benoit R. Cottereau

Abstract

Reliable visual place recognition (VPR) under dynamic real-world conditions is critical for autonomous robots, yet conventional deep networks remain limited by high computational and energy demands. Inspired by the mammalian navigation system, we introduce SpikeVPR, a bio-inspired and neuromorphic approach combining event-based cameras with spiking neural networks (SNNs) to generate compact, invariant place descriptors from few exemplars, achieving robust recognition under extreme changes in illumination, viewpoint, and appearance. SpikeVPR is trained end-to-end using surrogate gradient learning and incorporates EventDilation, a novel augmentation strategy enhancing robustness to speed and temporal variations. Evaluated on two challenging benchmarks (Brisbane-Event-VPR and NSAVP), SpikeVPR achieves performance comparable to state-of-the-art deep networks while using 50 times fewer parameters and consuming 30 and 250 times less energy, enabling real-time deployment on mobile and neuromorphic platforms. These results demonstrate that spike-based coding offers an efficient pathway toward robust VPR in complex, changing environments.

Event-Driven Neuromorphic Vision Enables Energy-Efficient Visual Place Recognition

Abstract

Reliable visual place recognition (VPR) under dynamic real-world conditions is critical for autonomous robots, yet conventional deep networks remain limited by high computational and energy demands. Inspired by the mammalian navigation system, we introduce SpikeVPR, a bio-inspired and neuromorphic approach combining event-based cameras with spiking neural networks (SNNs) to generate compact, invariant place descriptors from few exemplars, achieving robust recognition under extreme changes in illumination, viewpoint, and appearance. SpikeVPR is trained end-to-end using surrogate gradient learning and incorporates EventDilation, a novel augmentation strategy enhancing robustness to speed and temporal variations. Evaluated on two challenging benchmarks (Brisbane-Event-VPR and NSAVP), SpikeVPR achieves performance comparable to state-of-the-art deep networks while using 50 times fewer parameters and consuming 30 and 250 times less energy, enabling real-time deployment on mobile and neuromorphic platforms. These results demonstrate that spike-based coding offers an efficient pathway toward robust VPR in complex, changing environments.

Paper Structure

This paper contains 4 sections, 15 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Visual place recognition (VPR) in classical frame-based, biological, and bio-inspired systems. Classical systems (illustrated in blue in the first row) typically rely on RGB images captured at a fixed sampling rate (e.g., 30 or 60 Hz). These images are processed by deep neural networks, such as ResNet or VGG, to extract discriminative descriptors of the different locations (middle panel). The descriptor of a query image is then compared with those stored in memory using a similarity metric to identify the closest match. Although this approach achieves strong retrieval performance, it remains challenging to deploy on resource-constrained platforms because it depends on tens of millions of real-valued parameters, leading to high computational and memory demands that limit their practicality for portable implementations. In contrast, visual place recognition in biological systems (shown in green in the second row) is far more efficient. The retina primarily transmits sparse information as spikes, which occur mostly when changes in illumination, either increments or decrements, are detected in the visual scene. Because spikes are all-or-none signals, their processing along the visual pathway is highly efficient, and the entire system is estimated to consume only around twenty watts. At the end of this processing, cortical structures such as the entorhinal cortex encode environmental features that support the formation of hippocampal place cells, neurons that represent specific locations in the explored environment. The inset above the rat illustrates the place field of one such neuron. Our proposed approach, SpikeVPR (shown in orange in the last row), draws direct inspiration from biological systems. It employs an event-based camera, which, like the biological retina, detects changes in illumination in near real-time. The resulting ‘on’ and ‘off’ spikes are processed by a spiking neural network (SNN) that, similar to the brain, encodes scene descriptors using only binary values. In our implementation, the SNN is built on a SEW ResNet architecture, and scene descriptors are extracted using Spiking MixVPR (see the Materials and Methods).
  • Figure 1: Illustration of three core challenges in VPR, shown for both event-based and RGB modalities on the Brisbane-Event-VPR Dataset. (a) Occlusion: The same place is captured with and without a significant foreground obstruction (a vehicle windshield pillar), leading to large differences in visual input. (b) Dynamic Objects: A parked truck partially occludes the scene in one traverse, generating spurious high-activity regions in the event frame that are absent in the reference, even though the underlying place remains the same. (c) Perceptual Aliasing: Two structurally similar but geographically distinct locations (1.4 km apart) produce similar appearances, creating a risk of false matches for VPR systems. In event data, red and blue spikes indicate brightness increases and decreases, respectively. RGB frames are shown for illustration only and are not used in our study.
  • Figure 2: Overview of the SpikeVPR architecture and its training procedure, illustrated using the Brisbane VPR dataset. Given the event stream corresponding to the vehicle’s current location (purple), SpikeVPR generates a 4096-dimensional descriptor through a spiking neural network (SNN). The architecture consists of an encoder built from Spike-Element-Wise (SEW) ResNet blocks (shown in the lower part of the middle panel) followed by an aggregation module. The SNN is trained using surrogate gradient learning (SGL) with a contrastive loss. This objective encourages high similarity between the latent representation of the current location and those of positive examples (i.e., places located within 30 meters of the query and sampled across different traversals, in green), while reducing similarity with descriptors corresponding to negative locations (i.e., other places in pink). RGB images are displayed for visualization purposes only and are not used in the processing pipeline.
  • Figure 2: Illustration from the NSAVP dataset showing how the motion of a vehicle equipped with an event-based camera affects VPR. Each pair of images depicts the same location captured during two different traversals. (a) Stop with moving objects: When the vehicle is stationary, static scene elements generate no events, leaving only moving objects (e.g., passing cars) visible. This causes a near-complete loss of place-descriptive information, with dynamic distractors dominating the event representation. (b) Slowing Down: As the vehicle decelerates, event rates drop, producing sparse and noisy frames with reduced scene coverage compared to normal-speed traverses. In both cases, the event representation of the same place differs markedly between traverses, highlighting the sensitivity of event-based descriptors to vehicle velocity changes and dynamic objects.
  • Figure 3: Recall@N (top row) and Precision (bottom row) curves on the Brisbane VPR dataset. Results obtained with SpikeVPR are shown in orange. For comparison, we include curves estimated using the sum of absolute differences (SAD) method (blue) and principal component analysis (PCA) (green). The values displayed on the left of the first row correspond to the Recall@1. The values displayed on the right correspond to the Recall@5 (first row) and precision Recall@100 (second row). (a) Average performance across all traverses (Daytime, Morning, Sunrise, Sunset 2). Shaded regions indicate the standard error of the mean. (b) Performance measured separately for each tested traverse.
  • ...and 5 more figures