Table of Contents
Fetching ...

Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

Liang Zhao, Kunming Shao, Zhipeng Liao, Xijie Huang, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Yi Zou

TL;DR

This work tackles efficient FP8 inference/training on digital compute-in-memory by enabling variable aligned-mantissa bitwidths across FP8 formats. It introduces a software-hardware co-design comprising Dynamic Shift-aware Bitwidth Prediction (DSBP), a Mantissa Prediction Unit (MPU), a FIFO-based Input Alignment Unit (FIAU), and a precision-scalable INT MAC array, all implemented in 28nm with a 64×96 CIM array. The architecture demonstrates 20.4 TFLOPS/W for fixed $E5M7$ and up to 2.8× higher FP8 efficiency than prior FP-CIM work while supporting all FP8 formats from $E2M5$ to $E5M2$, validated on Llama-7b (BoolQ, Winogrande) and ResNet18/ImageNet. Results show that on-the-fly DSBP maintains accuracy while delivering flexible accuracy–efficiency trade-offs, highlighting the benefits of software-hardware co-design for variable-mantissa FP8 computation in digital CIM.

Abstract

FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.

Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

TL;DR

This work tackles efficient FP8 inference/training on digital compute-in-memory by enabling variable aligned-mantissa bitwidths across FP8 formats. It introduces a software-hardware co-design comprising Dynamic Shift-aware Bitwidth Prediction (DSBP), a Mantissa Prediction Unit (MPU), a FIFO-based Input Alignment Unit (FIAU), and a precision-scalable INT MAC array, all implemented in 28nm with a 64×96 CIM array. The architecture demonstrates 20.4 TFLOPS/W for fixed and up to 2.8× higher FP8 efficiency than prior FP-CIM work while supporting all FP8 formats from to , validated on Llama-7b (BoolQ, Winogrande) and ResNet18/ImageNet. Results show that on-the-fly DSBP maintains accuracy while delivering flexible accuracy–efficiency trade-offs, highlighting the benefits of software-hardware co-design for variable-mantissa FP8 computation in digital CIM.

Abstract

FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (212b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 6496 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8 higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.
Paper Structure (11 sections, 1 equation, 8 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 1 equation, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) FP8 parameters extracted from Llama-7b with different format. (b) Requirement of variable-mantissa computation based on FP-DCIM.
  • Figure 2: Overall framework of our software-hardware co-design Variable-Mantissa FP8 DCIM accelerator.
  • Figure 3: The schematic of the proposed MPU.
  • Figure 4: FIAU achieves alignment by controlling pointer movement.
  • Figure 5: The schematic of the adder tree and fusion unit.
  • ...and 3 more figures