Table of Contents
Fetching ...

BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

Xiaomeng Han, Yuan Cheng, Jing Wang, Junyang Lu, Hui Wang, X. x. Zhang, Ning Xu, Dawei Yang, Zhe Jiang

TL;DR

This work tackles the challenge of deploying large language models on edge devices by addressing quantisation inefficiencies in Block Floating Point (BFP) formats. It introduces Bidirectional Block Floating Point (BBFP), which uses a 1-bit flag and overlap bits to reduce the likelihood of aligning data to the maximum shared exponent, thereby lowering quantisation error while preserving fixed-point efficiency. Building on BBFP, the authors design BBAL, a full-stack accelerator comprising a BBFP-based processing element array and a cost-efficient nonlinear computation unit with segmented LUTs, enabling efficient handling of nonlinear layers such as Softmax and GELU. Experimental results show BBFP-based nonlinear quantisation (BBFP(10,5)) achieves minimal accuracy loss (max 0.44 PPL increase) and BBAL delivers 22% higher accuracy than an outlier-aware baseline at similar area, plus about 40% higher throughput than a BFP-based accelerator at similar accuracy, highlighting the practical impact for edge-LLM inference.

Abstract

Large language models (LLMs), with their billions of parameters, pose substantial challenges for deployment on edge devices, straining both memory capacity and computational resources. Block Floating Point (BFP) quantisation reduces memory and computational overhead by converting high-overhead floating point operations into low-bit fixed point operations. However, BFP requires aligning all data to the maximum exponent, which causes loss of small and moderate values, resulting in quantisation error and degradation in the accuracy of LLMs. To address this issue, we propose a Bidirectional Block Floating Point (BBFP) data format, which reduces the probability of selecting the maximum as shared exponent, thereby reducing quantisation error. By utilizing the features in BBFP, we present a full-stack Bidirectional Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL), primarily comprising a processing element array based on BBFP, paired with proposed cost-effective nonlinear computation unit. Experimental results show BBAL achieves a 22% improvement in accuracy compared to an outlier-aware accelerator at similar efficiency, and a 40% efficiency improvement over a BFP-based accelerator at similar accuracy.

BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

TL;DR

This work tackles the challenge of deploying large language models on edge devices by addressing quantisation inefficiencies in Block Floating Point (BFP) formats. It introduces Bidirectional Block Floating Point (BBFP), which uses a 1-bit flag and overlap bits to reduce the likelihood of aligning data to the maximum shared exponent, thereby lowering quantisation error while preserving fixed-point efficiency. Building on BBFP, the authors design BBAL, a full-stack accelerator comprising a BBFP-based processing element array and a cost-efficient nonlinear computation unit with segmented LUTs, enabling efficient handling of nonlinear layers such as Softmax and GELU. Experimental results show BBFP-based nonlinear quantisation (BBFP(10,5)) achieves minimal accuracy loss (max 0.44 PPL increase) and BBAL delivers 22% higher accuracy than an outlier-aware baseline at similar area, plus about 40% higher throughput than a BFP-based accelerator at similar accuracy, highlighting the practical impact for edge-LLM inference.

Abstract

Large language models (LLMs), with their billions of parameters, pose substantial challenges for deployment on edge devices, straining both memory capacity and computational resources. Block Floating Point (BFP) quantisation reduces memory and computational overhead by converting high-overhead floating point operations into low-bit fixed point operations. However, BFP requires aligning all data to the maximum exponent, which causes loss of small and moderate values, resulting in quantisation error and degradation in the accuracy of LLMs. To address this issue, we propose a Bidirectional Block Floating Point (BBFP) data format, which reduces the probability of selecting the maximum as shared exponent, thereby reducing quantisation error. By utilizing the features in BBFP, we present a full-stack Bidirectional Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL), primarily comprising a processing element array based on BBFP, paired with proposed cost-effective nonlinear computation unit. Experimental results show BBAL achieves a 22% improvement in accuracy compared to an outlier-aware accelerator at similar efficiency, and a 40% efficiency improvement over a BFP-based accelerator at similar accuracy.

Paper Structure

This paper contains 18 sections, 15 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Distribution of activation and weight values in OPT-6.7B. (b) Linear and nonlinear runtime in the decoder stage of Llama-7B.
  • Figure 2: (a) The basic components of BBFP(4,2); (b) comparing the representational range of mantissas between BBFP and BFP; (c) the FP to BFP process; (d) the FP to BBFP process.
  • Figure 3: Comparison of the impact of different selection of shared exponent of activation quantisation error with BBFP (4,2).
  • Figure 4: The PPL and hardware overhead for BBFP with a width of 6 under varying overlap bit-widths.
  • Figure 5: (a) Multiplication operation between two BBFP blocks; (b) partial sum operation with BBFP and carry chain module.
  • ...and 4 more figures