Table of Contents
Fetching ...

A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats

Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George A. Constantinides, Yiren Zhao

TL;DR

This work addresses the memory and compute challenges of large language model inference by introducing Microscaling (MX) formats and a dataflow compiler, MASE, that automatically searches for mixed-precision MXInt quantization tailored to LLMs. The core contributions include the MASE Intermediate Representation (MASE IR) for hardware-aware, trainable co-design; an orchestration framework to plug in custom MX formats via software emulators and hardware templates; and an open-source MX operator library with source-level hardware evaluation. Empirical results across multiple LLM families show that MP MXInt achieves around 4-bit average precision with minimal accuracy loss and about a 24% Δ accuracy improvement, while incurring only ~3% overhead in energy efficiency relative to 8-bit fixed-point baselines; MXInt generally offers superior area efficiency and accuracy compared to fixed-point and uniform MX formats. The work demonstrates a practical pathway to ASIC-like efficiency for LLM inference through model-specific mixed-precision quantization and dataflow-based mapping, and it lays the groundwork for broader MX-format exploration and hardware specialization in future accelerators.

Abstract

Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large dynamic ranges and high memory density. In this paper, we propose a compiler named MASE for exploring mixed-precision MX formats on dataflow hardware accelerators for LLM inference. Our main contributions are twofold. First, we propose a novel orchestration abstraction to explore both software and hardware optimizations with new data formats. Second, MASE achieves LLM inference at an average precision of 4-bits, with minimal to no accuracy degradation. To our knowledge, MASE represents the first effort to harness fine-grain multi-precision MX formats in the design of LLM hardware accelerators. Over a range of LLMs and datasets, MASE achieves an average improvement of 24% in $Δ$ accuracy with an overhead of only 3% in energy efficiency compared to designs using 8-bit fixed-point numbers.

A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats

TL;DR

This work addresses the memory and compute challenges of large language model inference by introducing Microscaling (MX) formats and a dataflow compiler, MASE, that automatically searches for mixed-precision MXInt quantization tailored to LLMs. The core contributions include the MASE Intermediate Representation (MASE IR) for hardware-aware, trainable co-design; an orchestration framework to plug in custom MX formats via software emulators and hardware templates; and an open-source MX operator library with source-level hardware evaluation. Empirical results across multiple LLM families show that MP MXInt achieves around 4-bit average precision with minimal accuracy loss and about a 24% Δ accuracy improvement, while incurring only ~3% overhead in energy efficiency relative to 8-bit fixed-point baselines; MXInt generally offers superior area efficiency and accuracy compared to fixed-point and uniform MX formats. The work demonstrates a practical pathway to ASIC-like efficiency for LLM inference through model-specific mixed-precision quantization and dataflow-based mapping, and it lays the groundwork for broader MX-format exploration and hardware specialization in future accelerators.

Abstract

Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large dynamic ranges and high memory density. In this paper, we propose a compiler named MASE for exploring mixed-precision MX formats on dataflow hardware accelerators for LLM inference. Our main contributions are twofold. First, we propose a novel orchestration abstraction to explore both software and hardware optimizations with new data formats. Second, MASE achieves LLM inference at an average precision of 4-bits, with minimal to no accuracy degradation. To our knowledge, MASE represents the first effort to harness fine-grain multi-precision MX formats in the design of LLM hardware accelerators. Over a range of LLMs and datasets, MASE achieves an average improvement of 24% in accuracy with an overhead of only 3% in energy efficiency compared to designs using 8-bit fixed-point numbers.
Paper Structure (20 sections, 4 equations, 8 figures, 4 tables)

This paper contains 20 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example of mapping LLaMA onto a dataflow accelerator. The large variances of each activation cross different layers in (a) motivate us to use quantization with MX formats in (c). We achieve mixed-precision quantization in (b) and map the model onto a dataflow architecture in (d). The dataflow schedule exploits task-level parallelism in (f), leading to a higher throughput compared to a non-dataflow schedule in (e). The proposed MASE compiler provides a fully automated and efficient approach to exploring software and hardware optimizations for MX formats.
  • Figure 2: A toy model in MASE IR after quantization and hardware parallelism.
  • Figure 3: Orchestration of existing tools for new data formats exploration. Given both software and hardware specifications for a new data format, MASE automatically explores resource-constrained quantization search for a given LLM.
  • Figure 4: Evaluation of search algorithms for OPT125M on sst2. MASE orchestrates existing search algorithms to explore resource-constrained quantization with mixed-precision MXInt formats. The cost function is shown on the y label (described in Section \ref{['sec:method:mxint_search']}. $acc$ = accuracy, $b$ = average bitwidth. $k$ is a hyperparameter to normalize costs. We observed that TPE is the most efficient search algorithm for MXInt quantization.
  • Figure 5: Evaluation of three MX data formats for quantizing LLMs on sst2. The area efficiency results are plotted relative to int8 results (higher means better). The accuracy are represented as its difference with the accuracy using FP32 (higher means better). To ensure fairness, all the formats have a block size of 32 that contains an 8-bit shared component and 8-bit local components, leading to an average bitwidth of 8 bits. Overall, MXInt has shown both high area efficiency and high accuracy for LLM quantization.
  • ...and 3 more figures