Table of Contents
Fetching ...

ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra

TL;DR

ParetoQ addresses conflicting scaling laws for ultra-low-bit LLM quantization by unifying training and quantization across 1, 1.58, 2, 3, and 4 bits. It introduces learnable-range quantization functions (SEQ, LSQ) and a comprehensive framework to compare bit-widths on equal footing, revealing a learning transition between 2- and 3-bit regimes. The approach demonstrates that 1.58/2/3-bit quantization can surpass 4-bit in accuracy at similar model sizes, with 2-bit offering strong hardware-friendly performance and speedups. Empirically, ParetoQ achieves state-of-the-art results across multiple models and tasks, including a 600M ternary model that outperforms prior SoTA with far fewer parameters, and hardware experiments indicate meaningful on-device acceleration for 2-bit quantization.

Abstract

The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.

ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

TL;DR

ParetoQ addresses conflicting scaling laws for ultra-low-bit LLM quantization by unifying training and quantization across 1, 1.58, 2, 3, and 4 bits. It introduces learnable-range quantization functions (SEQ, LSQ) and a comprehensive framework to compare bit-widths on equal footing, revealing a learning transition between 2- and 3-bit regimes. The approach demonstrates that 1.58/2/3-bit quantization can surpass 4-bit in accuracy at similar model sizes, with 2-bit offering strong hardware-friendly performance and speedups. Empirically, ParetoQ achieves state-of-the-art results across multiple models and tasks, including a 600M ternary model that outperforms prior SoTA with far fewer parameters, and hardware experiments indicate meaningful on-device acceleration for 2-bit quantization.

Abstract

The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.

Paper Structure

This paper contains 27 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Pareto curves of accuracy-size trade-offs.
  • Figure 2: With a fixed total training budget of 100B tokens ($\mathcal{B}_\text{train}$), where $\mathcal{B}_\text{FP} + \mathcal{B}_\text{QAT} = \mathcal{B}_\text{train}$, we explore optimal allocation between full-precision pretraining ($\mathcal{B}_\text{FP}$) and QAT fine-tuning ($\mathcal{B}_\text{QAT}$). "0.0" represents QAT from scratch, while "1.0" indicates full-precision pretraining followed by PTQ. Results on MobileLLM-125M show peak accuracy with $\sim$90% of the budget for full-precision pretraining and $\sim$10% for QAT fine-tuning.
  • Figure 3: Analysis of training token requirements for quantization-aware fine-tuning and training from scratch across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit settings. Fine-tuning typically saturates at 10B tokens for 3-bit and 4-bit, and at 30B tokens for 1-bit, 1.58-bit, and 2-bit. Fine-tuning consistently outperforms training from scratch in both accuracy and token efficiency across all bit configurations.
  • Figure 4: L1 norm difference between QAT-finetuned weights and full-precision initialization ($||W_{\text{finetune}}$$- W_{\text{init}}||_{l1}$$/||W_{\text{init}}||_{l1}$). Models quantized to 1, 1.58, and 2 bits show larger weight changes, indicating distinct 'compensation' behavior in higher-bit quantization and 'reconstruction' in lower-bit settings.
  • Figure 5: Impact of quantization grid choice across bit widths. Binary quantization uses a sign function; Ternary and 2-bit prefer more balanced output levels and range coverage; For 3-bit and higher, including "0" in quantization levels is more favorable.
  • ...and 9 more figures