TDFormer: A Top-Down Attention-Controlled Spiking Transformer
Zizheng Zhu, Yingchao Yu, Zeqi Zheng, Zhaofei Yu, Yaochu Jin
TL;DR
TDFormer addresses temporal information bottlenecks in spiking neural networks by introducing a top-down feedback pathway that leverages high-order representations from earlier time steps to modulate later processing. The architecture employs a TDAC (control module and processing module) to realign attention with temporal context, yielding forward mutual information growth and enhanced gradient flow along the time dimension, supported by theoretical analysis. Empirically, it achieves state-of-the-art results across ImageNet and neuromorphic benchmarks (e.g., an ImageNet top-1 accuracy of 86.83%) with only modest energy overhead, bridging the performance gap between SNNs and ANNs. The work highlights a biologically inspired, energy-efficient route for Transformer-based SNNs and suggests avenues for extending the approach to other backbones and tasks.
Abstract
Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model's performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.
