Table of Contents
Fetching ...

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava

TL;DR

This work investigates how exponent concentration in GenAI weights enables lossless compression within low-precision FP formats. By showing that exponents follow $α$-stable distributions under stochastic gradient dynamics, the authors derive finite entropy bounds and a theoretical compression limit near $FP4.67$, motivating a practical FP8 approach. They introduce Exponent-Concentrated FP8 (ECF8), a lossless FP8 compression framework with Huffman-based exponent coding, GPU-optimized decoding, and just-in-time decompression, achieving up to 26.9% memory savings and up to 177.1% throughput gains on models up to 671B parameters while preserving bit-exact outputs. The results establish exponent concentration as a statistical law of trained models and provide a principled pathway for designing next-generation low-precision floating-point formats for GenAI.

Abstract

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $α$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

TL;DR

This work investigates how exponent concentration in GenAI weights enables lossless compression within low-precision FP formats. By showing that exponents follow -stable distributions under stochastic gradient dynamics, the authors derive finite entropy bounds and a theoretical compression limit near , motivating a practical FP8 approach. They introduce Exponent-Concentrated FP8 (ECF8), a lossless FP8 compression framework with Huffman-based exponent coding, GPU-optimized decoding, and just-in-time decompression, achieving up to 26.9% memory savings and up to 177.1% throughput gains on models up to 671B parameters while preserving bit-exact outputs. The results establish exponent concentration as a statistical law of trained models and provide a principled pathway for designing next-generation low-precision floating-point formats for GenAI.

Abstract

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from -stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

Paper Structure

This paper contains 35 sections, 2 theorems, 12 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $E = \lfloor \log_2 |X| \rfloor$ where $X \sim S_\alpha(\beta=0, \gamma, \delta)$. Then $E$ follows a discrete two-sided geometric distribution with parameter $q = 2^{-\alpha}$: The Shannon entropy of $E$ is bounded by: In particular, $H(E)$ is finite for all $\alpha > 0$.

Figures (4)

  • Figure 1: Entropy analysis across transformer blocks for different model architectures. The x-axis represents the block index within each model, and the y-axis shows the entropy values. Different colors indicate different block types within each architecture.
  • Figure 2: A simplified illustration of lookup table construction. A Huffman tree is built from the string "aaabbcddeeeee". Lookup tables are configured to be 2-bit, so each subtable has 4 entries. Codes for symbols "e" and "a" are at most 2 bits long and appear directly in the first table. In contrast, codes for "b", "c", and "d" exceed 2 bits and begin with "11", so entry "11" in the first table points to a secondary table. The pointer value is 1, indicating subtable 1. A second lookup in this subtable resolves the final symbol.
  • Figure 3: Images generated by ECF8-compressed Qwen-Image model, demonstrating pixel-perfect reconstruction quality compared to the original FP8 model. The images generated by the original FP8 model are shown in Figure \ref{['fig:qwen_image_fp8']}.
  • Figure 4: Images generated by the FP8 Qwen-Image model. These are pixel-wise identical to the images generated by the ECF8-compressed model (see Figure \ref{['fig:qwen_image_ecf8']}).

Theorems & Definitions (3)

  • Theorem 2.1: Exponent Entropy Concentration
  • proof
  • Corollary 2.2: Compression Limit