SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization
Zhixiong Zhao, Fangxin Liu, Junjie Wang, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan
TL;DR
SpecQuant addresses ultra-low-bit quantization for large language models by introducing a two-stage, frequency-domain framework that first absorbs activation outliers into the weight domain via smoothing and then performs activation-aware channel-wise low-frequency Fourier truncation. This spectral truncation preserves most signal energy while suppressing high-frequency noise, enabling robust 4-bit quantization with limited accuracy loss. The method achieves substantial practical gains, including ~2× speedups and ~3× memory savings on LLaMA models, with minimal zero-shot accuracy degradation (e.g., ~1.21% on LLaMA-3-8B). By leveraging channel-wise spectral decay and a data-driven budget allocation via spectral entropy, SpecQuant offers a principled, outlier-resilient path to deploy ultra-efficient LLMs on edge devices.
Abstract
The emergence of accurate open large language models (LLMs) has sparked a push for advanced quantization techniques to enable efficient deployment on end-user devices. In this paper, we revisit the challenge of extreme LLM compression -- targeting ultra-low-bit quantization for both activations and weights -- from a Fourier frequency domain perspective. We propose SpecQuant, a two-stage framework that tackles activation outliers and cross-channel variance. In the first stage, activation outliers are smoothed and transferred into the weight matrix to simplify downstream quantization. In the second stage, we apply channel-wise low-frequency Fourier truncation to suppress high-frequency components while preserving essential signal energy, improving quantization robustness. Our method builds on the principle that most of the weight energy is concentrated in low-frequency components, which can be retained with minimal impact on model accuracy. To enable runtime adaptability, we introduce a lightweight truncation module during inference that adjusts truncation thresholds based on channel characteristics. On LLaMA-3 8B, SpecQuant achieves 4-bit quantization for both weights and activations, narrowing the zero-shot accuracy gap to only 1.5% compared to full precision, while delivering 2 times faster inference and 3times lower memory usage.
