Table of Contents
Fetching ...

Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

Lucas Maisonnave, Cyril Moineau, Olivier Bichler, Fabrice Rastello

TL;DR

This work tackles the challenge of deploying large language models on edge devices by addressing activation outliers that hinder low-bit quantization. It introduces Gradual Binary Search (GBS) guided by perplexity and employs Hadamard rotation matrices to achieve $3$-bit WAKV quantization, complemented by Paley-based dimension expansion to support non-$2^k$ embedding sizes. The authors provide theoretical results showing Hadamard matrices minimize outliers and prove near-optimality among orthogonal rotations, then demonstrate empirical gains across Mistral, LLaMA, and Qwen models. The approach delivers meaningful accuracy and perplexity improvements while broadening architectural compatibility, enabling practical, efficient quantization for diverse LLMs on resource-constrained hardware.

Abstract

Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

TL;DR

This work tackles the challenge of deploying large language models on edge devices by addressing activation outliers that hinder low-bit quantization. It introduces Gradual Binary Search (GBS) guided by perplexity and employs Hadamard rotation matrices to achieve -bit WAKV quantization, complemented by Paley-based dimension expansion to support non- embedding sizes. The authors provide theoretical results showing Hadamard matrices minimize outliers and prove near-optimality among orthogonal rotations, then demonstrate empirical gains across Mistral, LLaMA, and Qwen models. The approach delivers meaningful accuracy and perplexity improvements while broadening architectural compatibility, enabling practical, efficient quantization for diverse LLMs on resource-constrained hardware.

Abstract

Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

Paper Structure

This paper contains 32 sections, 5 theorems, 17 equations, 6 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.1

$\forall x \in \mathbb{R}^n$ containing an outlier, i.e., $x = (c, \epsilon, ..., \epsilon)^T$ with $c >> \epsilon$ we have with $H$ a Hadamard matrix belonging to $\mathbb{R}^{n \times n}$ and $Q$ a rotation matrix drawn randomly on the unit sphere $\mathcal{S}^{n-1} = \{x \in \mathbb{R}^n: ||x||_2 = 1\}$.

Figures (6)

  • Figure 1: Architecture's pipeline with rotation matrices $R_1, \, R_2,\, R_3,\, R_4$ and dimension expansion. Red lines represents expanded tokens in $4096 + d$ dimensions and green lines represents non expanded tokens. Projections in red (QKV, Gate, Up and $LM_{head}$) have their input weigths dimension expanded and projections in green (Out, Down and Embeddings) have their output weights dimension expanded
  • Figure 2: Effect of expanding dimensions on 6 benchmarks average (AVG) for different models in 3 bits WAKV quantization and the computational limit of Lemma \ref{['lemme:3']}. Due to memory constraints on GPU A100 we could not increase more than 2036 dimensions for LLaMA3-8B.
  • Figure 3: Maximum absolute value as a function of dimension for a randomly drawn rotation matrix and a Hadamard matrix applied to a vector containing a peak at 200 obtained experimentally (blue) and theoretically (red)
  • Figure 4: PPL vs Layer during Gradual Binary Search on 10% of Train WikiText2 for a LLaMA3-8B in 4-bit quantization and rotated with QuaRot. For better visualization we set a maximum PPL to 9. Points opacity represents the clipping ratio, the value is closer to 0 as transparency increases
  • Figure 5: Final configurations obtained with GBS started in 4 bits and in FP16 for a LLaMA3-8B
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Hadamard reduction
  • Lemma 3.1: Hadamard incoherence
  • Lemma 3.2: Rotation incoherence
  • Theorem 3.2: Hadamard optimality
  • Lemma 4.1: Expanding limit
  • proof : Proof of Lemma \ref{['lemme:1']}
  • proof : Proof of Lemma \ref{['lemme:2']}
  • proof : Proof of Theorem \ref{['thm:HdmOpt']}
  • proof : Proof of Lemma \ref{['lemme:3']}