Table of Contents
Fetching ...

Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation

Zhenxin Lei, Man Yao, Jiakui Hu, Xinhao Luo, Yanye Lu, Bo Xu, Guoqi Li

TL;DR

Spike2Former tackles the core difficulty of applying Spiking Neural Networks to dense image segmentation by identifying modules that cause information loss and introducing Spike-driven transformer components along with a Normalized Integer LIF. The method combines a Spike-driven Deformable Transformer Encoder, Spike-driven Transformer Decoder, Spike-driven Mask Embedding, and NI-LIF to preserve information, stabilize training, and enable energy-efficient spike-driven inference. Empirically, it achieves state-of-the-art SNN segmentation on ADE20k, VOC2012, and CityScapes with substantial mIoU gains and major energy savings compared to prior SNNs, while approaching ANN performance. The work demonstrates the viability of complex, transformer-based architectures for dense prediction with SNNs and provides practical design patterns for minimizing information loss in spike-based models.

Abstract

Spiking Neural Networks (SNNs) have a low-power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0 efficiency on ADE20K, +14.3% mIoU and 5.2 efficiency on VOC2012, and +9.1% mIoU and 6.6 efficiency on CityScapes.

Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation

TL;DR

Spike2Former tackles the core difficulty of applying Spiking Neural Networks to dense image segmentation by identifying modules that cause information loss and introducing Spike-driven transformer components along with a Normalized Integer LIF. The method combines a Spike-driven Deformable Transformer Encoder, Spike-driven Transformer Decoder, Spike-driven Mask Embedding, and NI-LIF to preserve information, stabilize training, and enable energy-efficient spike-driven inference. Empirically, it achieves state-of-the-art SNN segmentation on ADE20k, VOC2012, and CityScapes with substantial mIoU gains and major energy savings compared to prior SNNs, while approaching ANN performance. The work demonstrates the viability of complex, transformer-based architectures for dense prediction with SNNs and provides practical design patterns for minimizing information loss in spike-based models.

Abstract

Spiking Neural Networks (SNNs) have a low-power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0 efficiency on ADE20K, +14.3% mIoU and 5.2 efficiency on VOC2012, and +9.1% mIoU and 6.6 efficiency on CityScapes.

Paper Structure

This paper contains 19 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (A) The architecture of Spike2Former, we propose Spike-Driven Mask Embedding(SDME) and Spike-driven Deformable Transformer Encoder to introduce the transformer-based method to SNNs. (B) Micro Design in Spike-Driven Deformable Transformer Encoder(SDTE) including Spike-driven Deformable attention(SDDA) and Spike Separable Convolution(ESC). (C) Comparison between NI-LIF and I-LIF. Integer activation results in information loss, especially in cross-model interaction. NI-LIF normalizes the integer during training to preserve information and converts them to spikes during inference with only sparse addition.
  • Figure 2: Illustration of offset-sampling operation in Spike-Driven Deformable Attention (SDDA). While directly spiking the query (Sampled Spike Query) leads to information loss during attention, spiking the attention weights (Spike Attention Weight) effectively preserves this crucial query information.
  • Figure 3: Segment Patterns Represents by Queries. We Average its corresponding segment predictions over the whole validation set of ADE20k. All the predictions are resized to the resolution of $200 \times 200$ for illustration purposes.