Table of Contents
Fetching ...

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

Yun-Chen Lo, Gu-Yeon Wei, David Brooks

TL;DR

This work tackles the memory bottlenecks of ever-larger LLMs by advancing direct-cast compression through Nanoscaling Floating-Point (NxFP), which integrates NanoMantissa, Adaptive Microexponent, and Code Recycling to outperform Microscaling (MxFP) at ultra-low bitwidths. The proposed NxFP framework improves quantization accuracy (up to $MSE$ reductions of $\sim$23\% from NanoMantissa alone and up to $0.64$ perplexity points in inference) and reduces memory footprint (up to $16\%$) while maintaining or enhancing reasoning accuracy (up to $30.2\%$ MMLU gains) across multiple modern LLMs. A practical on-the-fly dequantization pathway enables deployment on off-the-shelf hardware, offering a frictionless route to efficient inference with direct-cast compression. Overall, NxFP delivers superior perplexity, accuracy, and footprint trade-offs for sub-6-bit quantization, facilitating scalable, hardware-friendly LLM deployment.

Abstract

As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models

TL;DR

This work tackles the memory bottlenecks of ever-larger LLMs by advancing direct-cast compression through Nanoscaling Floating-Point (NxFP), which integrates NanoMantissa, Adaptive Microexponent, and Code Recycling to outperform Microscaling (MxFP) at ultra-low bitwidths. The proposed NxFP framework improves quantization accuracy (up to reductions of 23\% from NanoMantissa alone and up to perplexity points in inference) and reduces memory footprint (up to ) while maintaining or enhancing reasoning accuracy (up to MMLU gains) across multiple modern LLMs. A practical on-the-fly dequantization pathway enables deployment on off-the-shelf hardware, offering a frictionless route to efficient inference with direct-cast compression. Overall, NxFP delivers superior perplexity, accuracy, and footprint trade-offs for sub-6-bit quantization, facilitating scalable, hardware-friendly LLM deployment.

Abstract

As cutting-edge large language models (LLMs) continue to transform various industries, their fast-growing model size and sequence length have led to memory traffic and capacity challenges. Recently, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm have proposed a Microscaling standard (Mx), which augments block floating-point with microexponents to achieve promising perplexity-to-footprint trade-offs. However, the Microscaling suffers from significant perplexity degradation on modern LLMs with less than six bits. This paper profiles modern LLMs and identifies three main challenges of low-bit Microscaling format, i.e., inaccurate tracking of outliers, vacant quantization levels, and wasted binary code. In response, Nanoscaling (NxFP) proposes three techniques, i.e., NanoMantissa, Adaptive Microexponent, and Code Recycling to enable better accuracy and smaller memory footprint than state-of-the-art MxFP. Experimental results on direct-cast inference across various modern LLMs demonstrate that our proposed methods outperform state-of-the-art MxFP by up to 0.64 in perplexity and by up to 30% in accuracy on MMLU benchmarks. Furthermore, NxFP reduces memory footprint by up to 16% while achieving comparable perplexity as MxFP.
Paper Structure (19 sections, 12 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: (a) Block FP, (b) Microscaling FP, and (c) our Nanoscaling FP (NxFP). NxFP proposes NanoMantissa, Adaptive MicroExponent, and Code Recycling to outperform the MxFP standard.
  • Figure 2: Visualizing quantization of a real FP16 vector using MxFP. The shared exponent tracks the largest exponent value in each block, and the microexponents track the exponent offset relative the shared exponent.
  • Figure 3: Profiling weights scaled by $E_{shared}$ (block size=32) on five modern LLMs. We show three challenges of MxFP4 format, including vacant quantization level, wasted code, and inefficient tracking of outliers.
  • Figure 4: (a) MxFP4 and (b) MxFP4 with NanoMantissa. The proposed NanoMantissa enable MxFP4 with greater precision so that it can track the largest value more accurately.
  • Figure 5: (a) Inter-block distribution heterogeneity, meaning that different vectors have distinct distribution, motivating each block to have its optimized format, e.g., MxFP4 or BFP4. (b) We propose to use an index bit to indicate whether using MxFP or BFP for a target block. (c) Logically, we fuse two formats into one, which adapts to different scenarios.
  • ...and 7 more figures