Table of Contents
Fetching ...

eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization

Aditya Agrawal, Matthew Hedlund, Blake Hechtman

TL;DR

eXmY introduces a general-purpose data type for arbitrary bit-width quantization, defined by 1 sign bit, X exponent bits, and Y mantissa bits, yielding a flexible bit width B = 1 + X + Y that encompasses integers and floating-point formats with subnormals and configurable exponent bias. It provides emulation, encoding/decoding codecs, and a bit-packing scheme that achieves perfect compression via power-of-2 decomposition, enabling byte-addressable storage and seamless sharding across tensors and tiles. The approach leverages an exponent-distribution observation in ML models to reduce exponent bits while maintaining accuracy, and demonstrates that formats like e3m1 can preserve quality across many datasets, including PaLM-2 S, with per-row metadata facilitating quality retention at low bit-widths. Practically, eXmY reduces memory, bandwidth, and storage demands, supports PTQ and QAT, and has been deployed in production, offering a path to flexible, hardware-agnostic quantization that adapts to model and deployment constraints. The modular design—comprising a datatype, emulation, codecs, and a distribution-based quantization technique—paves the way for scalable, multi-tenant inference and training workflows with arbitrary precision.

Abstract

eXmY is a novel data type for quantization of ML models. It supports both arbitrary bit widths and arbitrary integer and floating point formats. For example, it seamlessly supports 3, 5, 6, 7, 9 bit formats. For a specific bit width, say 7, it defines all possible formats e.g. e0m6, e1m5, e2m4, e3m3, e4m2, e5m1 and e6m0. For non-power of two bit widths e.g. 5, 6, 7, we created a novel encoding and decoding scheme which achieves perfect compression, byte addressability and is amenable to sharding and vector processing. We implemented libraries for emulation, encoding and decoding tensors and checkpoints in C++, TensorFlow, JAX and PAX. For optimal performance, the codecs use SIMD instructions on CPUs and vector instructions on TPUs and GPUs. eXmY is also a technique and exploits the statistical distribution of exponents in tensors. It can be used to quantize weights, static and dynamic activations, gradients, master weights and optimizer state. It can reduce memory (CPU DRAM and accelerator HBM), network and disk storage and transfers. It can increase multi tenancy and accelerate compute. eXmY has been deployed in production for almost 2 years.

eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization

TL;DR

eXmY introduces a general-purpose data type for arbitrary bit-width quantization, defined by 1 sign bit, X exponent bits, and Y mantissa bits, yielding a flexible bit width B = 1 + X + Y that encompasses integers and floating-point formats with subnormals and configurable exponent bias. It provides emulation, encoding/decoding codecs, and a bit-packing scheme that achieves perfect compression via power-of-2 decomposition, enabling byte-addressable storage and seamless sharding across tensors and tiles. The approach leverages an exponent-distribution observation in ML models to reduce exponent bits while maintaining accuracy, and demonstrates that formats like e3m1 can preserve quality across many datasets, including PaLM-2 S, with per-row metadata facilitating quality retention at low bit-widths. Practically, eXmY reduces memory, bandwidth, and storage demands, supports PTQ and QAT, and has been deployed in production, offering a path to flexible, hardware-agnostic quantization that adapts to model and deployment constraints. The modular design—comprising a datatype, emulation, codecs, and a distribution-based quantization technique—paves the way for scalable, multi-tenant inference and training workflows with arbitrary precision.

Abstract

eXmY is a novel data type for quantization of ML models. It supports both arbitrary bit widths and arbitrary integer and floating point formats. For example, it seamlessly supports 3, 5, 6, 7, 9 bit formats. For a specific bit width, say 7, it defines all possible formats e.g. e0m6, e1m5, e2m4, e3m3, e4m2, e5m1 and e6m0. For non-power of two bit widths e.g. 5, 6, 7, we created a novel encoding and decoding scheme which achieves perfect compression, byte addressability and is amenable to sharding and vector processing. We implemented libraries for emulation, encoding and decoding tensors and checkpoints in C++, TensorFlow, JAX and PAX. For optimal performance, the codecs use SIMD instructions on CPUs and vector instructions on TPUs and GPUs. eXmY is also a technique and exploits the statistical distribution of exponents in tensors. It can be used to quantize weights, static and dynamic activations, gradients, master weights and optimizer state. It can reduce memory (CPU DRAM and accelerator HBM), network and disk storage and transfers. It can increase multi tenancy and accelerate compute. eXmY has been deployed in production for almost 2 years.
Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Emulation using e2m1 with different schemes.
  • Figure 2: Bit packing and unpacking for 7-bit wide elements.
  • Figure 3: Histogram of the exponent values.