To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava
TL;DR
This work investigates how exponent concentration in GenAI weights enables lossless compression within low-precision FP formats. By showing that exponents follow $α$-stable distributions under stochastic gradient dynamics, the authors derive finite entropy bounds and a theoretical compression limit near $FP4.67$, motivating a practical FP8 approach. They introduce Exponent-Concentrated FP8 (ECF8), a lossless FP8 compression framework with Huffman-based exponent coding, GPU-optimized decoding, and just-in-time decompression, achieving up to 26.9% memory savings and up to 177.1% throughput gains on models up to 671B parameters while preserving bit-exact outputs. The results establish exponent concentration as a statistical law of trained models and provide a principled pathway for designing next-generation low-precision floating-point formats for GenAI.
Abstract
The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $α$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.
