Table of Contents
Fetching ...

Boosting Entropy with Bell Box Quantization

Ningfeng Yang, Tor M. Aamodt

TL;DR

BBQ is proposed, the first ITO quantization method that is also compute-efficient, and performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types.

Abstract

Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at https://github.com/1733116199/bbq.

Boosting Entropy with Bell Box Quantization

TL;DR

BBQ is proposed, the first ITO quantization method that is also compute-efficient, and performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types.

Abstract

Quantization-Aware Pre-Training (QAPT) is an effective technique to reduce the compute and memory overhead of Deep Neural Networks while improving their energy efficiency on edge devices. Existing QAPT methods produce models stored in compute-efficient data types (e.g. integers) that are not information theoretically optimal (ITO). On the other hand, existing ITO data types (e.g. Quantile/NormalFloat Quantization) are not compute-efficient. We propose BBQ, the first ITO quantization method that is also compute-efficient. BBQ builds on our key insight that since learning is domain-agnostic, the output of a quantizer does not need to reside in the same domain as its input. BBQ performs ITO quantization in its input domain, and returns its output in a compute-efficient domain where ITO data types are mapped to compute-efficient data types. Without sacrificing compute efficiency, BBQ outperforms prior SOTA QAPT methods by a perplexity reduction of up to 2 points for 4-bit models, up to 4 points for 3-bit models, up to 5 points for 2-bit models, and up to 18 points for 1-bit models. Code is available at https://github.com/1733116199/bbq.
Paper Structure (26 sections, 11 equations, 22 figures, 10 tables, 1 algorithm)

This paper contains 26 sections, 11 equations, 22 figures, 10 tables, 1 algorithm.

Figures (22)

  • Figure 1: The three steps of the BBQ quantization formula (Equation \ref{['eqn:bbq_quant']}) for $b=4$. Step \ref{['fig:1a']} is the Hadamard Transform (HT) followed by RMS Normalization. Step \ref{['fig:1b']} is the probability integral transform (PIT). Step \ref{['fig:1c']} is uniform quantization. We name our method Bell Box Quantization because Figure \ref{['fig:1a']} looks like a bell and Figure \ref{['fig:1b']} looks like a box (rectangle) which is quantized in Figure \ref{['fig:1c']}.
  • Figure 2: Comparison of clip (blue) and the standard Gaussian CDF $\Phi$ (orange).
  • Figure 3: Quantized weight entropy vs. training iterations for LLaMA-300M with 2-bit weight and activations, pre-trained on 20 billion C4 tokens (batched into 80 thousand training iterations).
  • Figure 4: Kernel latency on Nvidia RTX 5090 (y-axis) vs. matrix size $N$ (x-axis).
  • Figure 5: End-to-end LLaMA inference latency of BBQ, FP16 and NF4 on RTX 5090 and A100 GPUs. For each linear layer, BBQ launches an activation quantization kernel (green region), an fp4/int4 matrix multiplication kernel (part of blue regions), and an element-wise scaling kernel (part of orange region).
  • ...and 17 more figures