ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang
TL;DR
ButterflyQuant addresses the memory bottleneck of deploying LLMs at extreme quantization by introducing learnable butterfly transforms that maintain orthogonality and adapt to layer-specific outlier patterns. Unlike fixed Hadamard rotations, the continuous Givens-rotation parameterization enables gradient-based learning with $O(n \log n)$ parameters, and non-power-of-2 dimensions are handled via Kronecker-based composites. The method achieves state-of-the-art 2-bit performance among rotation-based PTQ approaches, with substantial improvements in perplexity and task accuracy on LLaMA-2 models, while requiring only 128 calibration samples and minutes of training on a single GPU. This practical, layer-adaptive approach enables robust, memory-efficient deployment of large language models on consumer hardware, with a favorable trade-off between accuracy, complexity, and deployment latency.
Abstract
Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU.
