Table of Contents
Fetching ...

One-Timestep is Enough: Achieving High-performance ANN-to-SNN Conversion via Scale-and-Fire Neurons

Qiuyang Chen, Huiqi Yang, Qingyan Meng, Zhengyu Ma

TL;DR

This work targets the latency and energy costs of ANN-to-SNN conversions by proposing a single-timestep framework. It establishes the Temporal-to-Spatial Equivalence Theory to show that multi-timestep IF neuron behavior can be replicated by single-timestep multi-threshold mechanisms, and introduces Scale-and-Fire Neurons to enable efficient $T=1$ inference. The authors instantiate SFN in Transformers as SFormer, employing threshold adaptation for SoftMax and Bayesian optimization to tune scaling, achieving state-of-the-art results such as 88.8% top-1 on ImageNet-1K at $T=1$ with substantial energy savings. The approach combines theoretical guarantees with practical design choices and demonstrates strong performance across vision tasks, suggesting wide potential for energy-efficient, low-latency SNNs and future extensions to larger models, including language models.

Abstract

Spiking Neural Networks (SNNs) are gaining attention as energy-efficient alternatives to Artificial Neural Networks (ANNs), especially in resource-constrained settings. While ANN-to-SNN conversion (ANN2SNN) achieves high accuracy without end-to-end SNN training, existing methods rely on large time steps, leading to high inference latency and computational cost. In this paper, we propose a theoretical and practical framework for single-timestep ANN2SNN. We establish the Temporal-to-Spatial Equivalence Theory, proving that multi-timestep integrate-and-fire (IF) neurons can be equivalently replaced by single-timestep multi-threshold neurons (MTN). Based on this theory, we introduce the Scale-and-Fire Neuron (SFN), which enables effective single-timestep ($T=1$) spiking through adaptive scaling and firing. Furthermore, we develop the SFN-based Spiking Transformer (SFormer), a specialized instantiation of SFN within Transformer architectures, where spike patterns are aligned with attention distributions to mitigate the computational, energy, and hardware overhead of the multi-threshold design. Extensive experiments on image classification, object detection, and instance segmentation demonstrate that our method achieves state-of-the-art performance under single-timestep inference. Notably, we achieve 88.8% top-1 accuracy on ImageNet-1K at $T=1$, surpassing existing conversion methods.

One-Timestep is Enough: Achieving High-performance ANN-to-SNN Conversion via Scale-and-Fire Neurons

TL;DR

This work targets the latency and energy costs of ANN-to-SNN conversions by proposing a single-timestep framework. It establishes the Temporal-to-Spatial Equivalence Theory to show that multi-timestep IF neuron behavior can be replicated by single-timestep multi-threshold mechanisms, and introduces Scale-and-Fire Neurons to enable efficient inference. The authors instantiate SFN in Transformers as SFormer, employing threshold adaptation for SoftMax and Bayesian optimization to tune scaling, achieving state-of-the-art results such as 88.8% top-1 on ImageNet-1K at with substantial energy savings. The approach combines theoretical guarantees with practical design choices and demonstrates strong performance across vision tasks, suggesting wide potential for energy-efficient, low-latency SNNs and future extensions to larger models, including language models.

Abstract

Spiking Neural Networks (SNNs) are gaining attention as energy-efficient alternatives to Artificial Neural Networks (ANNs), especially in resource-constrained settings. While ANN-to-SNN conversion (ANN2SNN) achieves high accuracy without end-to-end SNN training, existing methods rely on large time steps, leading to high inference latency and computational cost. In this paper, we propose a theoretical and practical framework for single-timestep ANN2SNN. We establish the Temporal-to-Spatial Equivalence Theory, proving that multi-timestep integrate-and-fire (IF) neurons can be equivalently replaced by single-timestep multi-threshold neurons (MTN). Based on this theory, we introduce the Scale-and-Fire Neuron (SFN), which enables effective single-timestep () spiking through adaptive scaling and firing. Furthermore, we develop the SFN-based Spiking Transformer (SFormer), a specialized instantiation of SFN within Transformer architectures, where spike patterns are aligned with attention distributions to mitigate the computational, energy, and hardware overhead of the multi-threshold design. Extensive experiments on image classification, object detection, and instance segmentation demonstrate that our method achieves state-of-the-art performance under single-timestep inference. Notably, we achieve 88.8% top-1 accuracy on ImageNet-1K at , surpassing existing conversion methods.

Paper Structure

This paper contains 42 sections, 44 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison between conventional multi-timestep SNNs and our single-timestep SNNs. (a) Traditional SNNs require multiple timesteps ($T>1$) to accumulate information over time for accurate representation. (b) IF neurons emit either one spike or none, based on a single threshold $\theta$. (c) SFN use multiple thresholds to emit multiple spikes in a single timestep. (d) Our method replaces temporal accumulation with spatial threshold modulation, enabling high-accuracy inference within a single timestep ($T=1$).
  • Figure 2: The framework of our ANN2SNN method. (a) The standard ANN performs a forward pass while collecting activation values from each layer for parameter estimation. (b) The output curve of the SFN. (c) The SFN enables single-timestep inference by adopting parameters estimated from the ANN activation distribution, and further refined via Bayesian optimization. (d) Bayesian Optimization for efficiently searching for the optimal scaling factor $\lambda$, where the objective function evaluates the performance of the SNN configured with $\lambda$ on a validation dataset.
  • Figure 3: Accuracy and energy ratio under different scaling factor $\lambda$ on ImageNet-1K (ViT-Base). The optimal $\lambda=0.252$ is obtained via Bayesian optimization, balancing accuracy and energy ratio.
  • Figure 4: Accuracy under different $T$ on CIFAR-10 using ResNet-18. The optimized Multi-Threshold Neuron with maximum $N=T$ achieves comparable or even superior performance to the IF Neuron.
  • Figure 6: Activation and spike output distributions on ImageNet-1K using ViT-Base. (a)–(c): Distributions of activations from three stages in the ViT backbone — pre-QKV, post-GELU, and post-SoftMax, respectively; (d): Output spike distributions generated by the proposed SFN under different scaling factors $\lambda$, where the input is identical to (a). When $\lambda=0.252$, the output distribution closely aligns with the original pre-QKV activation distribution, confirming the effectiveness of the selected scaling in preserving information structure during conversion.