Table of Contents
Fetching ...

Compression Scaling Laws:Unifying Sparsity and Quantization

Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh

TL;DR

The paper addresses how compression techniques interact with LLM scaling during pretraining and proposes a unified framework based on an effective parameter count. It introduces the compressed scaling law $L(N, D, C) = \\frac{a}{(N \\cdot \\text{eff}(C))^b} + \\frac{c}{D^d} + e$, where $\\text{eff}(C)$ is the effective parameter multiplier and $N$ is the model size, $D$ the data, and $C$ the compression type. The study validates this law across weight-only quantization, full weight-and-activation quantization, and sparsity, showing weight-only quantization can yield near-lossless performance at 4 bits (\\text{EPM} \\approx 0.923) while full quantization experiences diminishing returns below 4 bits (e.g., $\\text{EPM}$ values drop to $0.747$ at 4 bits and $0.067$ at 1 bit), with sparsity generally offering lower multipliers but competitive results at specific regimes (e.g., 50% sparsity ≈ 0.871, close to 8-bit quantization). Overall, the framework enables principled comparisons and guidance for compute budgeting in compressed LLM training and deployment.

Abstract

We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.

Compression Scaling Laws:Unifying Sparsity and Quantization

TL;DR

The paper addresses how compression techniques interact with LLM scaling during pretraining and proposes a unified framework based on an effective parameter count. It introduces the compressed scaling law , where is the effective parameter multiplier and is the model size, the data, and the compression type. The study validates this law across weight-only quantization, full weight-and-activation quantization, and sparsity, showing weight-only quantization can yield near-lossless performance at 4 bits (\\text{EPM} \\approx 0.923) while full quantization experiences diminishing returns below 4 bits (e.g., values drop to at 4 bits and at 1 bit), with sparsity generally offering lower multipliers but competitive results at specific regimes (e.g., 50% sparsity ≈ 0.871, close to 8-bit quantization). Overall, the framework enables principled comparisons and guidance for compute budgeting in compressed LLM training and deployment.

Abstract

We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Learning rate sweep comparison for BF16 and INT3 models.
  • Figure 2: Validation of the independence between scaling laws and the amount of data used. The legend represents the number of bits per weight.
  • Figure 3: Scaling results (loss and fit) for weight-only quantization.
  • Figure 4: Scaling results for full quantization with linear and quadratic speedup counting.
  • Figure 5: Scaling law fit for full quantization.
  • ...and 2 more figures