Table of Contents
Fetching ...

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

Zhixiong Zhao, Fangxin Liu, Junjie Wang, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan

TL;DR

SpecQuant addresses ultra-low-bit quantization for large language models by introducing a two-stage, frequency-domain framework that first absorbs activation outliers into the weight domain via smoothing and then performs activation-aware channel-wise low-frequency Fourier truncation. This spectral truncation preserves most signal energy while suppressing high-frequency noise, enabling robust 4-bit quantization with limited accuracy loss. The method achieves substantial practical gains, including ~2× speedups and ~3× memory savings on LLaMA models, with minimal zero-shot accuracy degradation (e.g., ~1.21% on LLaMA-3-8B). By leveraging channel-wise spectral decay and a data-driven budget allocation via spectral entropy, SpecQuant offers a principled, outlier-resilient path to deploy ultra-efficient LLMs on edge devices.

Abstract

The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression -- targeting ultra-low-bit quantization for both activations and weights -- from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

TL;DR

SpecQuant addresses ultra-low-bit quantization for large language models by introducing a two-stage, frequency-domain framework that first absorbs activation outliers into the weight domain via smoothing and then performs activation-aware channel-wise low-frequency Fourier truncation. This spectral truncation preserves most signal energy while suppressing high-frequency noise, enabling robust 4-bit quantization with limited accuracy loss. The method achieves substantial practical gains, including ~2× speedups and ~3× memory savings on LLaMA models, with minimal zero-shot accuracy degradation (e.g., ~1.21% on LLaMA-3-8B). By leveraging channel-wise spectral decay and a data-driven budget allocation via spectral entropy, SpecQuant offers a principled, outlier-resilient path to deploy ultra-efficient LLMs on edge devices.

Abstract

The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression -- targeting ultra-low-bit quantization for both activations and weights -- from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.

Paper Structure

This paper contains 19 sections, 25 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Activation and weight distributions before and after naive smoothing. While smoothing aims to mitigate activation outliers, it often transfers the quantization burden to weights, introducing new outliers and degrading the robustness of both activations and weights under quantization.
  • Figure 2: Overview of the proposed SpecQuant .
  • Figure 3: Comparison between conventional quantization and SpecQuant . Outlier channels in input activations are marked in red. SpecQuant adaptively absorbs these outliers using frequency-domain approximation, reducing overall quantization error.
  • Figure 4: Comparison of weights decomposed by SpecQuant and SVDQuant.
  • Figure 5: Distribution comparison of the original weight magnitudes and the approximated weights by SpecQuant and SVD at different compression ratios for the first fully connected layer of the LLaMA-2 7B model.
  • ...and 1 more figures