Table of Contents
Fetching ...

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

TL;DR

ButterflyQuant addresses the memory bottleneck of deploying LLMs at extreme quantization by introducing learnable butterfly transforms that maintain orthogonality and adapt to layer-specific outlier patterns. Unlike fixed Hadamard rotations, the continuous Givens-rotation parameterization enables gradient-based learning with $O(n \log n)$ parameters, and non-power-of-2 dimensions are handled via Kronecker-based composites. The method achieves state-of-the-art 2-bit performance among rotation-based PTQ approaches, with substantial improvements in perplexity and task accuracy on LLaMA-2 models, while requiring only 128 calibration samples and minutes of training on a single GPU. This practical, layer-adaptive approach enables robust, memory-efficient deployment of large language models on consumer hardware, with a favorable trade-off between accuracy, complexity, and deployment latency.

Abstract

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU.

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

TL;DR

ButterflyQuant addresses the memory bottleneck of deploying LLMs at extreme quantization by introducing learnable butterfly transforms that maintain orthogonality and adapt to layer-specific outlier patterns. Unlike fixed Hadamard rotations, the continuous Givens-rotation parameterization enables gradient-based learning with parameters, and non-power-of-2 dimensions are handled via Kronecker-based composites. The method achieves state-of-the-art 2-bit performance among rotation-based PTQ approaches, with substantial improvements in perplexity and task accuracy on LLaMA-2 models, while requiring only 128 calibration samples and minutes of training on a single GPU. This practical, layer-adaptive approach enables robust, memory-efficient deployment of large language models on consumer hardware, with a favorable trade-off between accuracy, complexity, and deployment latency.

Abstract

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: for orthogonal . However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence --that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving computational complexity with only learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU.

Paper Structure

This paper contains 43 sections, 3 theorems, 28 equations, 5 figures, 5 tables.

Key Result

Theorem 3.1

The Hadamard matrix $\mathbf{H}_n$ for $n = 2^k$ can be exactly represented as a butterfly transform with specific parameter choices dao2019learning.

Figures (5)

  • Figure 1: Layer heterogeneity motivates learnable transforms for LLM quantization.(a) Different transformer layers exhibit distinct outlier distributions: attention (positive tails), early MLP (negative regions), late MLP (boundaries). (b) Hadamard transforms with discrete $\{+1,-1\}$ entries apply fixed rotations through recursive decomposition $\mathbf{H}_{2n} = \frac{1}{\sqrt{2}}[\mathbf{H}_n, \mathbf{H}_n; \mathbf{H}_n, -\mathbf{H}_n]$, achieving uniform coherence $\mu = 1/\sqrt{n} = 0.0156$ across all layers. (c) Butterfly transforms use continuous rotation angles $\theta_{i,j}$ in Givens rotations $\mathbf{G}(\theta)$, enabling gradient-based optimization to learn layer-specific patterns. This yields adaptive coherence that matches each layer's outlier distribution.
  • Figure 2: Mutual coherence $\mu(\mathbf{Q})$ across transformer layers for different rotation strategies on LLaMA-2-7B. Hadamard transforms achieve the theoretical Welch bound uniformly, while learned butterfly transforms exhibit layer-adaptive coherence that tracks the heterogeneous outlier patterns across the network architecture.
  • Figure 3: Impact of initialization strategy on final perplexity.
  • Figure 4: Convergence analysis showing 86% improvement within 200 steps.
  • Figure 5: Training dynamics of ButterflyQuant demonstrating the impact of key design choices.

Theorems & Definitions (5)

  • Theorem 3.1
  • proof : Proof Sketch
  • Theorem 3.2: Expressive Power of Butterfly Transforms
  • Theorem 7.1: Expressive Power of Butterfly Transforms
  • proof : Proof Sketch