Table of Contents
Fetching ...

Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst

TL;DR

The paper tackles the challenge of building an edge NPU capable of training and inference with precision-scalable Microscale (MX) data types. It introduces a hybrid precision-scalable reduction tree that blends integer accumulation with FP-style normalization to enable efficient mixed-precision accumulation while relaxing some accuracy constraints. The MX MAC array is organized into an 8×8 MX tensor core and integrated into the SNAX NPU platform with CSR-based control and dynamic data streaming to adapt bandwidth to the current MX precision. Experimental results show significant energy-efficiency gains over the state of the art PS-MX_MAC across MXINT8, MXFP8/6, and MXFP4 modes, along with high utilization on ResNet18 and Vision Transformer workloads, underscoring practical impact for continual-learning edge AI.

Abstract

Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

TL;DR

The paper tackles the challenge of building an edge NPU capable of training and inference with precision-scalable Microscale (MX) data types. It introduces a hybrid precision-scalable reduction tree that blends integer accumulation with FP-style normalization to enable efficient mixed-precision accumulation while relaxing some accuracy constraints. The MX MAC array is organized into an 8×8 MX tensor core and integrated into the SNAX NPU platform with CSR-based control and dynamic data streaming to adapt bandwidth to the current MX precision. Experimental results show significant energy-efficiency gains over the state of the art PS-MX_MAC across MXINT8, MXFP8/6, and MXFP4 modes, along with high utilization on ResNet18 and Vision Transformer workloads, underscoring practical impact for continual-learning edge AI.

Abstract

Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

Paper Structure

This paper contains 18 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Resource breakdown of the state-of-the-art precision-scalable Microscaling (MX) multiply-accumulate (MAC) unit PS-MX_MAC, where more than $80\%$ of the resources go to the reduction tree.
  • Figure 2: Overview and issues of the state-of-the-art reduction trees for MX MAC implementations: FP32 additionPS-MX_MAC, and Long integer additionMXDotP. Followed by the solutions proposed in this work.
  • Figure 3: First (a) and second (b) iteration of our proposed hybrid reduction tree architecture with examples (c, d) to illustrate the multiplexer in (b).
  • Figure 4: Comparing quantization error and addition error for reduced mantissa length in accumulation, both errors are normalized to the result computed in float64, which is also treated as the perfect result when computing the errors: (left) error comparison for MXFP8 E4M3 with matrix sizes of 64x64 and Gaussian distributed matrix elements, (right) the lowest mantissa lengths for which the quantization error is larger than the addition error, for both 64x64 and 256x256 matrix sizes and for uniform and Gaussian distributions of the matrix elements.
  • Figure 5: System architecture overview with precision-scalable MX tensor core integration.
  • ...and 3 more figures