Table of Contents
Fetching ...

Microscaling Floating Point Formats for Large Language Models

Marco Cococcioni, Dario Pagani, Federico Rossi

TL;DR

This work tackles the resource bottlenecks of large language models by adopting Microscaling, a block-based 8-bit floating-point approach that shares a single scale per block to extend dynamic range. It delivers a flexible C++23 implementation with a generic data-format interface, including an exact-accumulator option and LUT-backed arithmetic, enabling both training and inference under mixed-precision regimes. The approach is validated via GPT-2 experiments, showing that Microscaling can maintain competitive accuracy while reducing memory and compute, albeit with careful attention to rounding, operation ordering, and softmax stability. With future hardware support for low-bit formats, Microscaling has the potential to substantially accelerate LLM training and deployment at scale.

Abstract

The increasing computational and memory demands of large language models (LLMs) necessitate innovative approaches to optimize resource usage without compromising performance. This paper leverages microscaling floating-point formats, a novel technique designed to address these challenges by reducing the storage and computational overhead associated with numerical representations in LLMs. Unlike traditional floating-point representations that allocate a dedicated scale for each value, microscaling employs a shared scale across a block of values, enabling compact one-byte floating-point representations while maintaining an extended dynamic range. We explore the application of microscaling in the context of 8-bit floating-point formats to significantly reduce memory footprint and computational costs. We tested several configurations of microscaling floats within the GPT-2 LLM architecture, demonstrating that microscaling data formats can achieve competitive accuracy during training and inference, proving its efficacy as a resource-efficient alternative for deploying LLMs at scale. The source code is publicly available at: https://github.com/unipi-dii-compressedarith/llm.c-sve

Microscaling Floating Point Formats for Large Language Models

TL;DR

This work tackles the resource bottlenecks of large language models by adopting Microscaling, a block-based 8-bit floating-point approach that shares a single scale per block to extend dynamic range. It delivers a flexible C++23 implementation with a generic data-format interface, including an exact-accumulator option and LUT-backed arithmetic, enabling both training and inference under mixed-precision regimes. The approach is validated via GPT-2 experiments, showing that Microscaling can maintain competitive accuracy while reducing memory and compute, albeit with careful attention to rounding, operation ordering, and softmax stability. With future hardware support for low-bit formats, Microscaling has the potential to substantially accelerate LLM training and deployment at scale.

Abstract

The increasing computational and memory demands of large language models (LLMs) necessitate innovative approaches to optimize resource usage without compromising performance. This paper leverages microscaling floating-point formats, a novel technique designed to address these challenges by reducing the storage and computational overhead associated with numerical representations in LLMs. Unlike traditional floating-point representations that allocate a dedicated scale for each value, microscaling employs a shared scale across a block of values, enabling compact one-byte floating-point representations while maintaining an extended dynamic range. We explore the application of microscaling in the context of 8-bit floating-point formats to significantly reduce memory footprint and computational costs. We tested several configurations of microscaling floats within the GPT-2 LLM architecture, demonstrating that microscaling data formats can achieve competitive accuracy during training and inference, proving its efficacy as a resource-efficient alternative for deploying LLMs at scale. The source code is publicly available at: https://github.com/unipi-dii-compressedarith/llm.c-sve

Paper Structure

This paper contains 39 sections, 11 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison between some floating point formats
  • Figure 2: Anyfloat makes use of a support data structure, called Unpacked, to represent the unpacked number
  • Figure 3: Example of an MX vector with block size 3 and an iterator instance with the auto-commit feature enabled
  • Figure 4: Learning with mixed precision and with a weights' master copy (figure inspired by the one provided in kalamkar2019studybfloat16deeplearning).
  • Figure 5: Comparison of the loss function with two different rounding policies
  • ...and 4 more figures