Table of Contents
Fetching ...

FAS: Fast ANN-SNN Conversion for Spiking Large Language Models

Long Chen, Xiaotian Song, Andy Song, BaDong Chen, Jiancheng Lv, Yanan Sun

TL;DR

FAS tackles the energy bottleneck of large language models by converting them to Spiking LLMs through a two-stage approach. Stage 1 uses full-parameter fine-tuning with a QCFS activation replacement to eliminate quantization and clipping errors, while Stage 2 employs layer-wise and neuron-wise coarse-to-fine calibration to reduce temporal errors, guided by activation-align and logits losses. Across NLU, NLG, and vision-language tasks, FAS achieves state-of-the-art performance at dramatically reduced time steps and energy consumption, including eight timesteps yielding comparable or better accuracy than ANN baselines and substantial energy savings. The method supports spiking Softmax and LayerNorm via UGO and demonstrates robust performance across multiple LLM scales and modalities, supported by comprehensive ablations and comparisons with existing SOTA ANN-SNN methods. Overall, FAS provides a practical and scalable pathway to high-performance, energy-efficient spiking LLMs suitable for deployment on neuromorphic hardware.

Abstract

Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model, while reducing energy consumption by 96.63\%. The source code is available at https://github.com/lc783/FAS

FAS: Fast ANN-SNN Conversion for Spiking Large Language Models

TL;DR

FAS tackles the energy bottleneck of large language models by converting them to Spiking LLMs through a two-stage approach. Stage 1 uses full-parameter fine-tuning with a QCFS activation replacement to eliminate quantization and clipping errors, while Stage 2 employs layer-wise and neuron-wise coarse-to-fine calibration to reduce temporal errors, guided by activation-align and logits losses. Across NLU, NLG, and vision-language tasks, FAS achieves state-of-the-art performance at dramatically reduced time steps and energy consumption, including eight timesteps yielding comparable or better accuracy than ANN baselines and substantial energy savings. The method supports spiking Softmax and LayerNorm via UGO and demonstrates robust performance across multiple LLM scales and modalities, supported by comprehensive ablations and comparisons with existing SOTA ANN-SNN methods. Overall, FAS provides a practical and scalable pathway to high-performance, energy-efficient spiking LLMs suitable for deployment on neuromorphic hardware.

Abstract

Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model, while reducing energy consumption by 96.63\%. The source code is available at https://github.com/lc783/FAS

Paper Structure

This paper contains 46 sections, 18 equations, 11 figures, 21 tables, 1 algorithm.

Figures (11)

  • Figure 1: Performance of ANN-SNN conversion methods on GPT-2 for WikiText-103.
  • Figure 2: The overall framework of the proposed FAS method. QC errors is composed of the quantization error and the clipping error.
  • Figure 3: Illustration of our observations.
  • Figure 5: The effectiveness of FAS for threshold and initial membrane potentials optimization.
  • Figure 6: Relationship between Errors and Cosine Similarity
  • ...and 6 more figures