Table of Contents
Fetching ...

Pushing the Limits of BFP on Narrow Precision LLM Inference

Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe Jiang

TL;DR

This work tackles the bottleneck of nonlinear operations in large language model inference by introducing Dynamic Block Floating-Point (DBFP) and a co-designed DB-Attn framework. It combines a pivot-focus and adaptive grouping strategy with the DH-LUT to accelerate Softmax and matrix computations, implemented via an RTL engine for FPGA/ASIC. The approach preserves accuracy while delivering substantial speedups, including notable improvements on LLaMA Softmax and overall throughput gains over state-of-the-art designs. The results demonstrate a viable path for efficient, narrow-precision LLM inference and highlight the value of algorithm-hardware co-design for complex transformer workloads.

Abstract

The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low overhead performance improvement over SOTA designs.

Pushing the Limits of BFP on Narrow Precision LLM Inference

TL;DR

This work tackles the bottleneck of nonlinear operations in large language model inference by introducing Dynamic Block Floating-Point (DBFP) and a co-designed DB-Attn framework. It combines a pivot-focus and adaptive grouping strategy with the DH-LUT to accelerate Softmax and matrix computations, implemented via an RTL engine for FPGA/ASIC. The approach preserves accuracy while delivering substantial speedups, including notable improvements on LLaMA Softmax and overall throughput gains over state-of-the-art designs. The results demonstrate a viable path for efficient, narrow-precision LLM inference and highlight the value of algorithm-hardware co-design for complex transformer workloads.

Abstract

The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low overhead performance improvement over SOTA designs.

Paper Structure

This paper contains 28 sections, 11 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Memory overhead and latency of prefill stages for LLaMA3-8B scale superlinearly with sequence length.
  • Figure 2: Number system comparison between floating-point numbers(a), BFP(b) and our DBFP(c).
  • Figure 3: Non-uniform hierarchical LUT with five intervals.
  • Figure 4: DB-Attn algorithm-driven Softmax hardware architecture and enhancement compared with FP16 design.
  • Figure 5: Pipeline's balanced proportion under input sequences length growth.
  • ...and 1 more figures