Table of Contents
Fetching ...

Kirin: Improving ANN efficiency with SNN Hybridization

Chenyu Wang, Zhanglu Yan, Zhi Zhou, Xu Chen, Weng-Fai Wong

TL;DR

Kirin tackles the energy challenge of large language model inference by converting pre-trained ANNs to lossless SNNs using an integer-spike hybrid approach. The method combines Spike Matrix Hybridization to keep long-bit outliers as integers and TTFS-encoded spikes for the rest, with a Silence Threshold TTFS strategy that preserves exact ANN outputs. It achieves near-FP16 accuracy under W4A(4&8) on Llama2-7B and OPT-2.7B while delivering up to 84% energy savings and up to 93.75% reduction in time steps, particularly benefiting attention operations. This work offers a practical, scalable pathway for energy-efficient, accurate SNN-based inference in large-scale transformers.

Abstract

Artificial neural networks (ANNs), particularly large language models (LLMs), demonstrate powerful inference capabilities but consume substantial energy. Conversely, spiking neural networks (SNNs) exhibit exceptional energy efficiency due to their binary and event-driven characteristics, thus motivating the study of ANN-to-SNN conversion. In this process, quantization plays a pivotal role, mapping LLMs' floating-point parameters to discrete SNN parameters via the temporal dimension of the time window. However, several challenges remain in the conversion process: (i) converting high bit-width quantization values into binary spikes requires longer time windows, increasing system latency; and (ii) the inherent trade-off between the information loss of single-spike schemes and the energy costs of multi-spike ones in SNN. To address these challenges, we propose Kirin, a integer and spike hybrid based SNN to achieve accuracy lossless ANN-to-SNN conversion with time and energy efficiency. Specifically, we first propose a Spike Matrix Hybridization strategy that encoding low bit-width parameters that leading to small time window size into binary spikes while preserving the rest in integer format, thereby reducing the overall latency of SNN execution. Second, we introduce a silence threshold mechanism to regulate the timing of single-spike firing, ensuring the output is mathematically equivalent to the LLM's output and preserves accuracy. Experimental results demonstrate that Kirin, under a W4A4\&8 quantization setting, achieves near-FP16 accuracy while reducing energy consumption by up to 84.66\% and shortening time steps by 93.75\%.

Kirin: Improving ANN efficiency with SNN Hybridization

TL;DR

Kirin tackles the energy challenge of large language model inference by converting pre-trained ANNs to lossless SNNs using an integer-spike hybrid approach. The method combines Spike Matrix Hybridization to keep long-bit outliers as integers and TTFS-encoded spikes for the rest, with a Silence Threshold TTFS strategy that preserves exact ANN outputs. It achieves near-FP16 accuracy under W4A(4&8) on Llama2-7B and OPT-2.7B while delivering up to 84% energy savings and up to 93.75% reduction in time steps, particularly benefiting attention operations. This work offers a practical, scalable pathway for energy-efficient, accurate SNN-based inference in large-scale transformers.

Abstract

Artificial neural networks (ANNs), particularly large language models (LLMs), demonstrate powerful inference capabilities but consume substantial energy. Conversely, spiking neural networks (SNNs) exhibit exceptional energy efficiency due to their binary and event-driven characteristics, thus motivating the study of ANN-to-SNN conversion. In this process, quantization plays a pivotal role, mapping LLMs' floating-point parameters to discrete SNN parameters via the temporal dimension of the time window. However, several challenges remain in the conversion process: (i) converting high bit-width quantization values into binary spikes requires longer time windows, increasing system latency; and (ii) the inherent trade-off between the information loss of single-spike schemes and the energy costs of multi-spike ones in SNN. To address these challenges, we propose Kirin, a integer and spike hybrid based SNN to achieve accuracy lossless ANN-to-SNN conversion with time and energy efficiency. Specifically, we first propose a Spike Matrix Hybridization strategy that encoding low bit-width parameters that leading to small time window size into binary spikes while preserving the rest in integer format, thereby reducing the overall latency of SNN execution. Second, we introduce a silence threshold mechanism to regulate the timing of single-spike firing, ensuring the output is mathematically equivalent to the LLM's output and preserves accuracy. Experimental results demonstrate that Kirin, under a W4A4\&8 quantization setting, achieves near-FP16 accuracy while reducing energy consumption by up to 84.66\% and shortening time steps by 93.75\%.
Paper Structure (19 sections, 30 equations, 4 figures, 6 tables)

This paper contains 19 sections, 30 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: From Floating-Point Values to Spike Encoding and Integrate-and-Fire mechanism. Ten floating-point values are randomly sampled and quantized (scale = 1.42, zero-point = 0); three examples are shown with their binary codes and corresponding spike trains.
  • Figure 2: Overview of the challenges and proposed ANN-to-SNN lossless conversion framework. Top: The two main bottlenecks in current SNN conversion—latency due to activation outliers and data loss in IF neurons. Bottom: The proposed solution utilizing Spike Matrix Hybridization to handle long latency and a Silence Threshold mechanism to ensure lossless outputs.
  • Figure 3: Energy consumption analysis across different model architectures.
  • Figure 4: Comparison of energy consumption. Top row: OPT models; Bottom row: Llama models.