Accurate and Efficient Event-based Semantic Segmentation Using Adaptive Spiking Encoder-Decoder Network

Rui Zhang; Luziwei Leng; Kaiwei Che; Hu Zhang; Jie Cheng; Qinghai Guo; Jiangxing Liao; Ran Cheng

Accurate and Efficient Event-based Semantic Segmentation Using Adaptive Spiking Encoder-Decoder Network

Rui Zhang, Luziwei Leng, Kaiwei Che, Hu Zhang, Jie Cheng, Qinghai Guo, Jiangxing Liao, Ran Cheng

TL;DR

This work tackles the challenge of deploying efficient spiking neural networks for dense, event-based semantic segmentation. It introduces SpikingEDN, a spiking encoder–decoder that uses AiLIF-based adaptive threshold encoding in the first layer and a dual-path SSAM module to enhance sparse event representation while remaining compatible with multiply-free inference. An architecture-search strategy refines the encoder design, and the SSAM module enables effective fusion of event streams with grayscale inputs, achieving MIoU of 72.57% on DDD17 and 58.32% on DSEC-Semantic, with substantial reductions in energy consumption compared to ANN rivals. The results demonstrate the untapped potential of SNNs for high-level vision tasks on neuromorphic-friendly hardware, while providing a practical path toward energy-efficient edge deployments and public-release code for reproducibility.

Abstract

Spiking neural networks (SNNs), known for their low-power, event-driven computation and intrinsic temporal dynamics, are emerging as promising solutions for processing dynamic, asynchronous signals from event-based sensors. Despite their potential, SNNs face challenges in training and architectural design, resulting in limited performance in challenging event-based dense prediction tasks compared to artificial neural networks (ANNs). In this work, we develop an efficient spiking encoder-decoder network (SpikingEDN) for large-scale event-based semantic segmentation tasks. To enhance the learning efficiency from dynamic event streams, we harness the adaptive threshold which improves network accuracy, sparsity and robustness in streaming inference. Moreover, we develop a dual-path Spiking Spatially-Adaptive Modulation module, which is specifically tailored to enhance the representation of sparse events and multi-modal inputs, thereby considerably improving network performance. Our SpikingEDN attains a mean intersection over union (MIoU) of 72.57\% on the DDD17 dataset and 58.32\% on the larger DSEC-Semantic dataset, showing competitive results to the state-of-the-art ANNs while requiring substantially fewer computational resources. Our results shed light on the untapped potential of SNNs in event-based vision applications. The source code will be made publicly available.

Accurate and Efficient Event-based Semantic Segmentation Using Adaptive Spiking Encoder-Decoder Network

TL;DR

Abstract

Paper Structure (25 sections, 11 equations, 8 figures, 9 tables)

This paper contains 25 sections, 11 equations, 8 figures, 9 tables.

Introduction
Background
Event-based Semantic Segmentation (EbSS)
SNNs for Dense Prediction
Adaptive Threshold Neuron
Methodology
Event Encoding with Adaptive Threshold Neuron
SSAM Modulation
Architecture of SpikingEDN
Experiments
Experiment on DDD17
Input Representation and Streaming Inference
Architecture Search and Retraining
Results
Experiment on DSEC-Semantic
...and 10 more sections

Figures (8)

Figure 1: Implementation of Spiking Spatially-Adaptive Modulation (SSAM) module. The SSAM module employs a dual-path SNN following MFI to enhance event representation. The augmented input may be original input events, images, or high-quality RGB images. Key components include: Conv (2D convolution operation), BN (batch normalization), Spike (spiking activation), Parallel Conv (multiple parallel dilated convolutions with differing dilation rates), Concat (concatenation operation), and Element-wise sum (addition operation performed between feature maps of the upper and lower paths).
Figure 2: The overall framework of our SpikingEDN. Top left: The encoder comprises six layers, involving downsampling and upsampling of the feature map. The black arrow denotes information transmission between layers, involving changes in feature map resolution. Two stem layers, represented by yellow shades, precede the encoder and serve the purpose of channel adaptation and early-stage feature extraction. DS denotes the downsampling rate. Bottom left: The detailed cell structure. The cell forms a directed acyclic graph across three layers, with each layer consisting of three nodes. Each layer receives spike inputs from the previous two layers. Within each layer, the nodes merge inputs from previous layers, and their outputs are concatenated to form the output of the layer. The red arrow represents the layer-to-node operation (5 $\times$ 5 conv). Concat denotes concatenation. Middle: The spiking Atrous Spatial Pyramid Pooling (ASPP) layer extracts multi-scale features from the encoder and feeds them into the decoder. The spiking ASPP includes four layers each with a 1 $\times$ 1 convolution and three dilated 3 $\times$ 3 convolutions, followed by BNs and spiking activations. The fifth layer features additional pooling before and upsampling after these operations. Outputs from all layers are concatenated and fed into the decoder. Right: The decoder comprises a sequence of spiking convolution and BN layers. Finally, an average upsampling layer is employed to refine boundary information and produce the final predicted semantic segmentation map.
Figure 3: Qualitative comparison on the DDD17 dataset. Red boxes in the images highlight areas where our SpikingEDN's predictions align more closely with the ground truth labels. Column (a) visualizes event data processed using the SBT method, while column (b) shows corresponding original grayscale images. Columns (c) and (d) compare the predictions of EV-SegNet and our SpikingEDN using only event data, respectively. Columns (e) and (f) display results from semantic segmentation using both images and events. The final column (g) contains the ground truth labels. Images from EV-SegNet are taken from their paper, whereas our results are based on the SSAM module.
Figure 4: Qualitative comparison on the DSEC-Semantic dataset. Our SpikingEDN, which combines events and frames as input, accurately captures certain details (marked area) better than other methods. The red boxes in the images denote areas where SpikingEDN's predictions closely match the ground truth labels. Figures in columns (a) and (b) represent visualizations of event data and RGB images, respectively. Diagrams in columns (c) and (d) depict the results of the ESS method with purely event-based input and combined event and image inputs, respectively. Columns (e) and (f) compare the predictions of SpikingEDN using solely event data and a combination of RGB images and events (with the SSAM module), respectively. The last column (g) contains the ground truth labels.
Figure 5: (a) Comparisons of network performances on the DDD17 dataset by applying LIF, AiLIF neuron on the first layer, or AiLIF neuron to the whole network, on a range of threshold values. Network training takes two different random seeds. (b) Distribution of spiking rates in the first layer. The horizontal axis represents the spiking rate, while the vertical axis denotes the number of neurons in a specific firing rate range. The vertical axis utilizes a logarithmic scale with a base of 10. (c) Illustration of event density and activation of AiLIF and LIF neurons in the first layer during inference for a short time. The activation of the AiLIF neuron (base threshold 0.3) is generally between the two LIF neurons of boundary thresholds. Note that its activation surpasses the LIF neuron with a 0.3 threshold in the first peak of event density, revealing the intricacy of dynamic modulation.
...and 3 more figures

Accurate and Efficient Event-based Semantic Segmentation Using Adaptive Spiking Encoder-Decoder Network

TL;DR

Abstract

Accurate and Efficient Event-based Semantic Segmentation Using Adaptive Spiking Encoder-Decoder Network

Authors

TL;DR

Abstract

Table of Contents

Figures (8)