Table of Contents
Fetching ...

TDFormer: A Top-Down Attention-Controlled Spiking Transformer

Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin

TL;DR

TDFormer addresses temporal information bottlenecks in spiking neural networks by introducing a top-down feedback pathway that leverages high-order representations from earlier time steps to modulate later processing. The architecture employs a TDAC (control module and processing module) to realign attention with temporal context, yielding forward mutual information growth and enhanced gradient flow along the time dimension, supported by theoretical analysis. Empirically, it achieves state-of-the-art results across ImageNet and neuromorphic benchmarks (e.g., an ImageNet top-1 accuracy of 86.83%) with only modest energy overhead, bridging the performance gap between SNNs and ANNs. The work highlights a biologically inspired, energy-efficient route for Transformer-based SNNs and suggests avenues for extending the approach to other backbones and tasks.

Abstract

Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.

TDFormer: A Top-Down Attention-Controlled Spiking Transformer

TL;DR

TDFormer addresses temporal information bottlenecks in spiking neural networks by introducing a top-down feedback pathway that leverages high-order representations from earlier time steps to modulate later processing. The architecture employs a TDAC (control module and processing module) to realign attention with temporal context, yielding forward mutual information growth and enhanced gradient flow along the time dimension, supported by theoretical analysis. Empirically, it achieves state-of-the-art results across ImageNet and neuromorphic benchmarks (e.g., an ImageNet top-1 accuracy of 86.83%) with only modest energy overhead, bridging the performance gap between SNNs and ANNs. The work highlights a biologically inspired, energy-efficient route for Transformer-based SNNs and suggests avenues for extending the approach to other backbones and tasks.

Abstract

Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.

Paper Structure

This paper contains 34 sections, 6 theorems, 96 equations, 5 figures, 6 tables.

Key Result

Proposition 4.1

The upper bound $\overline{\text{Var}}(Y_{tnc})$ for the $\mathbf{X} \odot \mathbf{M}_{\text{spatial}}$ is given as follows: where we assume each $\mathbf{X}_{t,n,c}$ is independent random variable $X_{tnc} \sim \text{Bernoulli}(f)$, with $f$ as the firing rate.

Figures (5)

  • Figure 1: Visualization of mutual information matrices of features across time steps on ImageNet. The left panel shows the baseline model; the right panel shows the model incorporating feedback connections. A higher level of mutual information suggests that the model captures more consistent and temporally dependent features across time steps
  • Figure 2: Overview of the TDFormer architecture. (a) Overall design inspired by top-down pathways in the brain, mimicking feedback from the prefrontal cortex to the visual cortex for temporal modulation in SNNs; (b) and (c) Detailed structures of the processing and control modules; (d) Information flow within the subnetwork, highlighting processing of feedback signals; (e) Four processing module variants, labeled v1–v4.
  • Figure 3: This is the histogram of the gradient of the surrogate function for LIF neurons in the attention module within the PM model. From the figure, we can see that the clamp operation ensures that the variance in the attention map does not become too large, thus preventing the vanishing gradient problem.
  • Figure 4: Visualization of CIFAR-10C. This figure showcases 19 columns corresponding to 19 different types of corruptions. Each column contains four images: the top image displays the original CIFAR-10C image; the second image shows the visualization result of the baseline model; the third image illustrates the first feedforward stage of the TDFormer model; the fourth image depicts the second feedforward stage of the TDFormer model, demonstrating the model’s dynamic attention adjustments across stages.
  • Figure 5: Visualization of ImageNet-C. This figure showcases 19 columns corresponding to 19 different types of corruptions. The layout and visualization style are similar to those shown in Figure \ref{['vis_cifar10']}.

Theorems & Definitions (12)

  • Proposition 4.1
  • Definition 4.2
  • Theorem 4.3
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • Lemma B.3
  • proof
  • ...and 2 more