Compression Scaling Laws:Unifying Sparsity and Quantization
Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh
TL;DR
The paper addresses how compression techniques interact with LLM scaling during pretraining and proposes a unified framework based on an effective parameter count. It introduces the compressed scaling law $L(N, D, C) = \\frac{a}{(N \\cdot \\text{eff}(C))^b} + \\frac{c}{D^d} + e$, where $\\text{eff}(C)$ is the effective parameter multiplier and $N$ is the model size, $D$ the data, and $C$ the compression type. The study validates this law across weight-only quantization, full weight-and-activation quantization, and sparsity, showing weight-only quantization can yield near-lossless performance at 4 bits (\\text{EPM} \\approx 0.923) while full quantization experiences diminishing returns below 4 bits (e.g., $\\text{EPM}$ values drop to $0.747$ at 4 bits and $0.067$ at 1 bit), with sparsity generally offering lower multipliers but competitive results at specific regimes (e.g., 50% sparsity ≈ 0.871, close to 8-bit quantization). Overall, the framework enables principled comparisons and guidance for compute budgeting in compressed LLM training and deployment.
Abstract
We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.
