Table of Contents
Fetching ...

TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks

Rui-Jie Zhu, Malu Zhang, Qihang Zhao, Haoyu Deng, Yule Duan, Liang-Jian Deng

TL;DR

TCJA-SNN introduces a Temporal-Channel Joint Attention mechanism for spiking neural networks that jointly models temporal and channel information using efficient 1-D convolutions and a Cross Convolutional Fusion layer. The framework compresses spike streams into an average matrix, applies temporal-wise and channel-wise local attention, and fuses them with CCF to produce salient features with reduced parameter count. It achieves state-of-the-art accuracy on static and neuromorphic datasets and demonstrates competitive performance in a fully spiking variation autoencoder for image generation, highlighting energy-efficient attention for SNNs. The approach is plug-and-play, enabling improvements across classification and generation tasks while maintaining hardware-friendly, low-precision spiking dynamics.

Abstract

Spiking Neural Networks (SNNs) are attracting widespread interest due to their biological plausibility, energy efficiency, and powerful spatio-temporal information representation ability. Given the critical role of attention mechanisms in enhancing neural network performance, the integration of SNNs and attention mechanisms exhibits potential to deliver energy-efficient and high-performance computing paradigms. We present a novel Temporal-Channel Joint Attention mechanism for SNNs, referred to as TCJA-SNN. The proposed TCJA-SNN framework can effectively assess the significance of spike sequence from both spatial and temporal dimensions. More specifically, our essential technical contribution lies on: 1) We employ the squeeze operation to compress the spike stream into an average matrix. Then, we leverage two local attention mechanisms based on efficient 1D convolutions to facilitate comprehensive feature extraction at the temporal and channel levels independently. 2) We introduce the Cross Convolutional Fusion (CCF) layer as a novel approach to model the inter-dependencies between the temporal and channel scopes. This layer breaks the independence of these two dimensions and enables the interaction between features. Experimental results demonstrate that the proposed TCJA-SNN outperforms SOTA by up to 15.7% accuracy on standard static and neuromorphic datasets, including Fashion-MNIST, CIFAR10-DVS, N-Caltech 101, and DVS128 Gesture. Furthermore, we apply the TCJA-SNN framework to image generation tasks by leveraging a variation autoencoder. To the best of our knowledge, this study is the first instance where the SNN-attention mechanism has been employed for image classification and generation tasks. Notably, our approach has achieved SOTA performance in both domains, establishing a significant advancement in the field. Codes are available at https://github.com/ridgerchu/TCJA.

TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks

TL;DR

TCJA-SNN introduces a Temporal-Channel Joint Attention mechanism for spiking neural networks that jointly models temporal and channel information using efficient 1-D convolutions and a Cross Convolutional Fusion layer. The framework compresses spike streams into an average matrix, applies temporal-wise and channel-wise local attention, and fuses them with CCF to produce salient features with reduced parameter count. It achieves state-of-the-art accuracy on static and neuromorphic datasets and demonstrates competitive performance in a fully spiking variation autoencoder for image generation, highlighting energy-efficient attention for SNNs. The approach is plug-and-play, enabling improvements across classification and generation tasks while maintaining hardware-friendly, low-precision spiking dynamics.

Abstract

Spiking Neural Networks (SNNs) are attracting widespread interest due to their biological plausibility, energy efficiency, and powerful spatio-temporal information representation ability. Given the critical role of attention mechanisms in enhancing neural network performance, the integration of SNNs and attention mechanisms exhibits potential to deliver energy-efficient and high-performance computing paradigms. We present a novel Temporal-Channel Joint Attention mechanism for SNNs, referred to as TCJA-SNN. The proposed TCJA-SNN framework can effectively assess the significance of spike sequence from both spatial and temporal dimensions. More specifically, our essential technical contribution lies on: 1) We employ the squeeze operation to compress the spike stream into an average matrix. Then, we leverage two local attention mechanisms based on efficient 1D convolutions to facilitate comprehensive feature extraction at the temporal and channel levels independently. 2) We introduce the Cross Convolutional Fusion (CCF) layer as a novel approach to model the inter-dependencies between the temporal and channel scopes. This layer breaks the independence of these two dimensions and enables the interaction between features. Experimental results demonstrate that the proposed TCJA-SNN outperforms SOTA by up to 15.7% accuracy on standard static and neuromorphic datasets, including Fashion-MNIST, CIFAR10-DVS, N-Caltech 101, and DVS128 Gesture. Furthermore, we apply the TCJA-SNN framework to image generation tasks by leveraging a variation autoencoder. To the best of our knowledge, this study is the first instance where the SNN-attention mechanism has been employed for image classification and generation tasks. Notably, our approach has achieved SOTA performance in both domains, establishing a significant advancement in the field. Codes are available at https://github.com/ridgerchu/TCJA.
Paper Structure (33 sections, 12 equations, 12 figures, 9 tables)

This paper contains 33 sections, 12 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: How our Temporal-Channel Joint Attention differs from existing temporal-wise attention yao2021temporal, which estimates the saliency of each time step by squeeze-and-excitation module. $T$ denotes the time step, $C$ denotes the channel, and $H,W$ represents the spatial resolution. By utilizing two separate 1-D convolutional layers and the Cross Convolutional Fusion (CCF) operation, our Temporal-Channel Joint Attention establishes the association between the time step and the channel.
  • Figure 2: Correlation between proximity time steps and channels. The top row is the input frame selected from DVS128 Gesture dataset. Each figure in the nine-pattern grid of the bottom row denotes a channel output from the first 2-D convolutional layer. It is clear that a significant correlation exists in channels with varying time steps, motivating us to merge the temporal and channel information.
  • Figure 3: The growth curve of parameters between Fully-Connected (FC) layer and TCJA layer when channel size $C = 64$.
  • Figure 4: The Framework of SNN with TCJA module. In SNNs, information is transmitted in the form of spike sequences, encompassing both temporal and spatial dimensions. In temporal-wise, the spiking neuron with a threshold feed-forward in membrane potential ($\boldsymbol{V}$) and spike ($\boldsymbol{S}$) as the Eq. \ref{['eq:SNN layer']}, and backpropagation with the surrogate function. In spatial-wise, data flows between layers as ANN. The TCJA module operates by initially compressing information along both temporal and spatial dimensions, then apply TLA and CLA to establish the relationship in both temporal and channel dimensions and blend them by CCF layer.
  • Figure 5: Illustration of the proposed TCJA. We give an average matrix $\mathcal{Z} \in \mathbb{R}^{6\times 5}$, and the goal of TCJA is to calculate a fusion matrix $\mathcal{F}$ integrating temporal and channel information. For instance, for a specific element in $\mathcal{F}$: $\mathcal{F}_{3,2}$, its calculation pipeline is as follows: 1) Calculate $\mathcal{T}_{3,2}$ through TLA mechanism (Eq. \ref{['equ_time_attention']}); 2) Utilize CLA mechanism (Eq. \ref{['equ_channel_attention']}) to calculate $\mathcal{C}_{3,2}$, and the calculation results are shown in the black dotted box in the figure; 3) Adopt CCF mechanism (Eq. \ref{['equ_ccf']}) to jointly learn temporal and channel information to obtain $\mathcal{F}_{3,2}$. In addition, we can also find that after the CCF mechanism, $\mathcal{F}_{3,2}$ integrates the information of the elements in the cross receptive field (Colored areas in $\mathcal{F}$) as the anchor point, which indicates the Cross Convolutional Fusion.
  • ...and 7 more figures