eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization
Aditya Agrawal, Matthew Hedlund, Blake Hechtman
TL;DR
eXmY introduces a general-purpose data type for arbitrary bit-width quantization, defined by 1 sign bit, X exponent bits, and Y mantissa bits, yielding a flexible bit width B = 1 + X + Y that encompasses integers and floating-point formats with subnormals and configurable exponent bias. It provides emulation, encoding/decoding codecs, and a bit-packing scheme that achieves perfect compression via power-of-2 decomposition, enabling byte-addressable storage and seamless sharding across tensors and tiles. The approach leverages an exponent-distribution observation in ML models to reduce exponent bits while maintaining accuracy, and demonstrates that formats like e3m1 can preserve quality across many datasets, including PaLM-2 S, with per-row metadata facilitating quality retention at low bit-widths. Practically, eXmY reduces memory, bandwidth, and storage demands, supports PTQ and QAT, and has been deployed in production, offering a path to flexible, hardware-agnostic quantization that adapts to model and deployment constraints. The modular design—comprising a datatype, emulation, codecs, and a distribution-based quantization technique—paves the way for scalable, multi-tenant inference and training workflows with arbitrary precision.
Abstract
eXmY is a novel data type for quantization of ML models. It supports both arbitrary bit widths and arbitrary integer and floating point formats. For example, it seamlessly supports 3, 5, 6, 7, 9 bit formats. For a specific bit width, say 7, it defines all possible formats e.g. e0m6, e1m5, e2m4, e3m3, e4m2, e5m1 and e6m0. For non-power of two bit widths e.g. 5, 6, 7, we created a novel encoding and decoding scheme which achieves perfect compression, byte addressability and is amenable to sharding and vector processing. We implemented libraries for emulation, encoding and decoding tensors and checkpoints in C++, TensorFlow, JAX and PAX. For optimal performance, the codecs use SIMD instructions on CPUs and vector instructions on TPUs and GPUs. eXmY is also a technique and exploits the statistical distribution of exponents in tensors. It can be used to quantize weights, static and dynamic activations, gradients, master weights and optimizer state. It can reduce memory (CPU DRAM and accelerator HBM), network and disk storage and transfers. It can increase multi tenancy and accelerate compute. eXmY has been deployed in production for almost 2 years.
