Table of Contents
Fetching ...

Spiking CenterNet: A Distillation-boosted Spiking Neural Network for Object Detection

Lennard Bodden, Franziska Schwaiger, Duc Bach Ha, Lars Kreuzberg, Sven Behnke

TL;DR

Spiking CenterNet tackles energy-efficient object detection on event data by introducing a fully spiking adaptation of CenterNet paired with an M2U-Net decoder. The method combines simple, binary-spiking building blocks, a train-from-scratch approach, and knowledge distillation from a non-spiking teacher, achieving competitive mAP on Prophesee GEN1 while reducing energy per inference. The study provides an energy model for SNNs and demonstrates strong robustness via time-step ablations, showing that multiple time steps can be downsampled via temporal averaging without sacrificing performance. Overall, the work delivers a practical, distillation-boosted spiking detector with clear potential for neuromorphic, edge deployments and future extensions to RGB data, 3D bounding boxes, and pose estimation.

Abstract

In the era of AI at the edge, self-driving cars, and climate change, the need for energy-efficient, small, embedded AI is growing. Spiking Neural Networks (SNNs) are a promising approach to address this challenge, with their event-driven information flow and sparse activations. We propose Spiking CenterNet for object detection on event data. It combines an SNN CenterNet adaptation with an efficient M2U-Net-based decoder. Our model significantly outperforms comparable previous work on Prophesee's challenging GEN1 Automotive Detection Dataset while using less than half the energy. Distilling the knowledge of a non-spiking teacher into our SNN further increases performance. To the best of our knowledge, our work is the first approach that takes advantage of knowledge distillation in the field of spiking object detection.

Spiking CenterNet: A Distillation-boosted Spiking Neural Network for Object Detection

TL;DR

Spiking CenterNet tackles energy-efficient object detection on event data by introducing a fully spiking adaptation of CenterNet paired with an M2U-Net decoder. The method combines simple, binary-spiking building blocks, a train-from-scratch approach, and knowledge distillation from a non-spiking teacher, achieving competitive mAP on Prophesee GEN1 while reducing energy per inference. The study provides an energy model for SNNs and demonstrates strong robustness via time-step ablations, showing that multiple time steps can be downsampled via temporal averaging without sacrificing performance. Overall, the work delivers a practical, distillation-boosted spiking detector with clear potential for neuromorphic, edge deployments and future extensions to RGB data, 3D bounding boxes, and pose estimation.

Abstract

In the era of AI at the edge, self-driving cars, and climate change, the need for energy-efficient, small, embedded AI is growing. Spiking Neural Networks (SNNs) are a promising approach to address this challenge, with their event-driven information flow and sparse activations. We propose Spiking CenterNet for object detection on event data. It combines an SNN CenterNet adaptation with an efficient M2U-Net-based decoder. Our model significantly outperforms comparable previous work on Prophesee's challenging GEN1 Automotive Detection Dataset while using less than half the energy. Distilling the knowledge of a non-spiking teacher into our SNN further increases performance. To the best of our knowledge, our work is the first approach that takes advantage of knowledge distillation in the field of spiking object detection.
Paper Structure (19 sections, 3 equations, 5 figures, 2 tables)

This paper contains 19 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our spiking object detection model. We combine a ResNet-18 encoder with M2U-Net-based decoding laibacher2019m2unet to feed into CenterNet-based heads zhou2019centernet. We remove any residual connections, and replace all activation functions with neurons. Postprocessing calculates bounding boxes from the head output.
  • Figure 2: Differences between M2UNet's laibacher2019m2unet original Inverted Residual block and our spiking adaptation which drops the non-binary residual connection and moves the activation function from the depth-wise to the point-wise linear block. These two can be merged during inference, thus making the entire block fully spiking.
  • Figure 3: Prediction of our best model (bottom) and ground truth (top) for selected scenes of the dataset. The different pixel colors indicate the two micro time bins with each two polarities of brightness change, resulting in four input channels (cf. Section \ref{['subsubsec:data']}). Note that targets might be invisible if there is no camera or object motion.
  • Figure 4: Impact of the number of time steps in evaluation with a fixed (a) and variable (b) time window for sampling events. Shown is mAP of our best models on the dataset tournemire2020prophesee.
  • Figure 5: Output of heatmap head (see Fig. \ref{['fig:architecture']}) averaged over time steps of the three evaluated models. Knowledge Distillation from the non-spiking teacher to the results in a less sparse, but smoother and ultimately better heatmap.