Table of Contents
Fetching ...

SFOD: Spiking Fusion Object Detector

Yimeng Fan, Wei Zhang, Changsong Liu, Mingyang Li, Wenrui Lu

TL;DR

The paper tackles object detection with event cameras by leveraging Spiking Neural Networks to exploit high temporal resolution and sparsity. It introduces the Spiking Fusion Module to fuse multi-scale SNN feature maps, integrated with Spiking DenseNet and an SSD head, and refines representations via the Spiking Pyramid Extraction Submodule. Through systematic analysis of spiking decoding strategies and loss functions, it demonstrates that Spiking Rate Decoding with Mean Squared Error loss yields state-of-the-art classification on NCAR ($93.7\%$) and that SFOD achieves a leading $mAP=32.1\%$ on GEN1 among SNN-based detectors, often surpassing non-SNN baselines with comparable parameters and energy efficiency. The work substantiates the potential of SNNs for event-based detection and provides practical design choices and supplementary derivations, along with code for replication. This advances both the theory and application of SNNs in real-time, energy-conscious vision tasks.

Abstract

Event cameras, characterized by high temporal resolution, high dynamic range, low power consumption, and high pixel bandwidth, offer unique capabilities for object detection in specialized contexts. Despite these advantages, the inherent sparsity and asynchrony of event data pose challenges to existing object detection algorithms. Spiking Neural Networks (SNNs), inspired by the way the human brain codes and processes information, offer a potential solution to these difficulties. However, their performance in object detection using event cameras is limited in current implementations. In this paper, we propose the Spiking Fusion Object Detector (SFOD), a simple and efficient approach to SNN-based object detection. Specifically, we design a Spiking Fusion Module, achieving the first-time fusion of feature maps from different scales in SNNs applied to event cameras. Additionally, through integrating our analysis and experiments conducted during the pretraining of the backbone network on the NCAR dataset, we delve deeply into the impact of spiking decoding strategies and loss functions on model performance. Thereby, we establish state-of-the-art classification results based on SNNs, achieving 93.7\% accuracy on the NCAR dataset. Experimental results on the GEN1 detection dataset demonstrate that the SFOD achieves a state-of-the-art mAP of 32.1\%, outperforming existing SNN-based approaches. Our research not only underscores the potential of SNNs in object detection with event cameras but also propels the advancement of SNNs. Code is available at https://github.com/yimeng-fan/SFOD.

SFOD: Spiking Fusion Object Detector

TL;DR

The paper tackles object detection with event cameras by leveraging Spiking Neural Networks to exploit high temporal resolution and sparsity. It introduces the Spiking Fusion Module to fuse multi-scale SNN feature maps, integrated with Spiking DenseNet and an SSD head, and refines representations via the Spiking Pyramid Extraction Submodule. Through systematic analysis of spiking decoding strategies and loss functions, it demonstrates that Spiking Rate Decoding with Mean Squared Error loss yields state-of-the-art classification on NCAR () and that SFOD achieves a leading on GEN1 among SNN-based detectors, often surpassing non-SNN baselines with comparable parameters and energy efficiency. The work substantiates the potential of SNNs for event-based detection and provides practical design choices and supplementary derivations, along with code for replication. This advances both the theory and application of SNNs in real-time, energy-conscious vision tasks.

Abstract

Event cameras, characterized by high temporal resolution, high dynamic range, low power consumption, and high pixel bandwidth, offer unique capabilities for object detection in specialized contexts. Despite these advantages, the inherent sparsity and asynchrony of event data pose challenges to existing object detection algorithms. Spiking Neural Networks (SNNs), inspired by the way the human brain codes and processes information, offer a potential solution to these difficulties. However, their performance in object detection using event cameras is limited in current implementations. In this paper, we propose the Spiking Fusion Object Detector (SFOD), a simple and efficient approach to SNN-based object detection. Specifically, we design a Spiking Fusion Module, achieving the first-time fusion of feature maps from different scales in SNNs applied to event cameras. Additionally, through integrating our analysis and experiments conducted during the pretraining of the backbone network on the NCAR dataset, we delve deeply into the impact of spiking decoding strategies and loss functions on model performance. Thereby, we establish state-of-the-art classification results based on SNNs, achieving 93.7\% accuracy on the NCAR dataset. Experimental results on the GEN1 detection dataset demonstrate that the SFOD achieves a state-of-the-art mAP of 32.1\%, outperforming existing SNN-based approaches. Our research not only underscores the potential of SNNs in object detection with event cameras but also propels the advancement of SNNs. Code is available at https://github.com/yimeng-fan/SFOD.
Paper Structure (20 sections, 17 equations, 5 figures, 5 tables)

This paper contains 20 sections, 17 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Detection performance vs firing rate of our SFOD on the GEN1 dataset. The areas of the circles correspond to the model size.
  • Figure 2: The architecture of SFOD. The Spiking Fusion Module is highlighted in the dotted area of the figure. In the fusion of layer four, Extra Block1 and Deconv Block4 are introduced and connected with the remainder of the network through the dotted lines.
  • Figure 3: The architectures of SPES. The blue block corresponds to the Pyramid Block in Figure \ref{['fig:SFOD_architecture']}.
  • Figure 4: Inference results of the model on the GEN1 dataset. The figure illustrates the detection capabilities of the models across specific scenarios: The first column demonstrates detection of overlapping cars; the second showcases non-overlapping detection; the third presents detection in sparse data contexts; the fourth reveals performance in multi-category scenes; and the fifth focuses on individual person target detection.
  • Figure A: More visual comparison results on the GEN1 dataset.