Price of universality in vector quantization is at most 0.11 bit
Alina Harbuzova, Or Ordentlich, Yury Polyanskiy
TL;DR
This work addresses universal vector quantization for weight matrices in inner-product computations by proving the existence of a universal codebook that nearly matches the waterfilling rate-distortion benchmark for all input statistics $\Sigma_X$. Using a random-coding construction with an isotropic Gaussian codebook and encoder-selected scale $\tau$, the authors show a per-coordinate distortion within $\mathbf{D}_{\mathrm{rc}}(\mathrm{spec}(\Sigma_X),R^{\star})+\text{small}$ uniformly over all $\Sigma_X$ with $\mathrm{tr}(\Sigma_X)=n$, with rate $R$ at most $R^{\star}+\varepsilon$. They further prove a worst-case overhead to the oracle waterfilling of at most $0.11$ bits per coordinate for distortions away from extreme values, highlighting that universality incurs only a small, universal rate penalty. The results imply the existence of a universal, near-optimal low-precision storage format for $W$ applicable across diverse input statistics, though the construction is non-constructive and explicit codebooks remain an open challenge. Overall, the paper connects universal quantization to Gaussian rate-distortion theory and dense nets over covariance families, offering a foundational understanding of universality in high-dimensional quantization under Hilbert norms.
Abstract
Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ ("weight-only quantization''). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector quantization codebook with PCA directions of $X$ (a process known as "waterfilling allocation''). Dependence of the codebook on statistics of $X$, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$, in the sense of being at least as good as an $X$-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
