Kirin: Improving ANN efficiency with SNN Hybridization
Chenyu Wang, Zhanglu Yan, Zhi Zhou, Xu Chen, Weng-Fai Wong
TL;DR
Kirin tackles the energy challenge of large language model inference by converting pre-trained ANNs to lossless SNNs using an integer-spike hybrid approach. The method combines Spike Matrix Hybridization to keep long-bit outliers as integers and TTFS-encoded spikes for the rest, with a Silence Threshold TTFS strategy that preserves exact ANN outputs. It achieves near-FP16 accuracy under W4A(4&8) on Llama2-7B and OPT-2.7B while delivering up to 84% energy savings and up to 93.75% reduction in time steps, particularly benefiting attention operations. This work offers a practical, scalable pathway for energy-efficient, accurate SNN-based inference in large-scale transformers.
Abstract
Artificial neural networks (ANNs), particularly large language models (LLMs), demonstrate powerful inference capabilities but consume substantial energy. Conversely, spiking neural networks (SNNs) exhibit exceptional energy efficiency due to their binary and event-driven characteristics, thus motivating the study of ANN-to-SNN conversion. In this process, quantization plays a pivotal role, mapping LLMs' floating-point parameters to discrete SNN parameters via the temporal dimension of the time window. However, several challenges remain in the conversion process: (i) converting high bit-width quantization values into binary spikes requires longer time windows, increasing system latency; and (ii) the inherent trade-off between the information loss of single-spike schemes and the energy costs of multi-spike ones in SNN. To address these challenges, we propose Kirin, a integer and spike hybrid based SNN to achieve accuracy lossless ANN-to-SNN conversion with time and energy efficiency. Specifically, we first propose a Spike Matrix Hybridization strategy that encoding low bit-width parameters that leading to small time window size into binary spikes while preserving the rest in integer format, thereby reducing the overall latency of SNN execution. Second, we introduce a silence threshold mechanism to regulate the timing of single-spike firing, ensuring the output is mathematically equivalent to the LLM's output and preserves accuracy. Experimental results demonstrate that Kirin, under a W4A4\&8 quantization setting, achieves near-FP16 accuracy while reducing energy consumption by up to 84.66\% and shortening time steps by 93.75\%.
