AT-SNN: Adaptive Tokens for Vision Transformer on Spiking Neural Network
Donghwa Kang, Youngmoon Lee, Eun-Kyu Lee, Brent Kang, Jinkyu Lee, Hyeongboo Baek
TL;DR
This work addresses the high energy cost of SNN-based vision transformers by extending adaptive computation time (ACT) to a two-dimensional halting framework across timesteps and encoder blocks, together with a token-merge mechanism to reduce token counts. It introduces AT-SNN, formulates token-level halting scores $h^{l,t}_{k}$ and accumulators $H_k(L',T')$, and optimizes a combined loss $\mathcal{L}_{overall}$ to encourage early, accurate halting. Implemented on Spikformer, AT-SNN achieves higher accuracy with fewer tokens and lower energy consumption than state-of-the-art CNN- and transformer-based SNN methods across CIFAR-10, CIFAR-100, and TinyImageNet, while providing interpretable token-processing heatmaps. The approach advances practical deployment of energy-efficient SNN ViTs by balancing computation across temporal and spatial dimensions and highlighting the benefit of temporally-aware token processing.
Abstract
In the training and inference of spiking neural networks (SNNs), direct training and lightweight computation methods have been orthogonally developed, aimed at reducing power consumption. However, only a limited number of approaches have applied these two mechanisms simultaneously and failed to fully leverage the advantages of SNN-based vision transformers (ViTs) since they were originally designed for convolutional neural networks (CNNs). In this paper, we propose AT-SNN designed to dynamically adjust the number of tokens processed during inference in SNN-based ViTs with direct training, wherein power consumption is proportional to the number of tokens. We first demonstrate the applicability of adaptive computation time (ACT), previously limited to RNNs and ViTs, to SNN-based ViTs, enhancing it to discard less informative spatial tokens selectively. Also, we propose a new token-merge mechanism that relies on the similarity of tokens, which further reduces the number of tokens while enhancing accuracy. We implement AT-SNN to Spikformer and show the effectiveness of AT-SNN in achieving high energy efficiency and accuracy compared to state-of-the-art approaches on the image classification tasks, CIFAR10, CIFAR-100, and TinyImageNet. For example, our approach uses up to 42.4% fewer tokens than the existing best-performing method on CIFAR-100, while conserving higher accuracy.
