Table of Contents
Fetching ...

H-FA: A Hybrid Floating-Point and Logarithmic Approach to Hardware Accelerated FlashAttention

Kosmas Alexandridis, Giorgos Dimitrakopoulos

TL;DR

This work addresses the attention bottleneck in long sequences by extending FlashAttention with H-FA, a hybrid floating-point and logarithmic hardware approach. By computing attention scores in FP and performing the fused softmax and value-weighted sum in the log domain, H-FA replaces costly FP multiplies/divisions with fixed-point additions and shifts, while avoiding explicit exponent evaluations until the final step. Hardware results at 28 nm show that H-FA achieves ~26–27% area reduction and ~23% power reduction compared with FP-only FlashAttention accelerators, with negligible impact on LLM accuracy across multiple models and benchmarks. The approach leverages LogDiv for log-domain division and Mitchell’s approximation to maintain accuracy, offering a practical path to energy-efficient, scalable attention hardware for large language models.

Abstract

Transformers have significantly advanced AI and machine learning through their powerful attention mechanism. However, computing attention on long sequences can become a computational bottleneck. FlashAttention mitigates this by fusing the softmax and matrix operations into a tiled computation pattern that decouples performance from sequence length. Though designed for GPUs, its simplicity also makes it well suited for direct hardware acceleration. To improve hardware implementation, we compute FlashAttention using a mixture of floating-point and fixed-point logarithm domain representations. Floating-point is used to compute attention scores from query and key matrices, while logarithmic computation simplifies the fused computation of softmax normalization and the multiplication with the value matrix. This transformation, called H-FA, replaces vector-wide floating-point multiplication and division operations by additions and subtractions implemented efficiently with fixed-point arithmetic in the logarithm domain. Exponential function evaluations are effectively omitted and fused with the rest operations, and the final result is directly returned to floating-point arithmetic without any additional hardware overhead. Hardware implementation results at 28nm demonstrate that H-FA achieves a 26.5% reduction in area and a 23.4% reduction in power, on average, compared to FlashAttention parallel hardware architectures built solely with floating-point datapaths, without hindering performance.

H-FA: A Hybrid Floating-Point and Logarithmic Approach to Hardware Accelerated FlashAttention

TL;DR

This work addresses the attention bottleneck in long sequences by extending FlashAttention with H-FA, a hybrid floating-point and logarithmic hardware approach. By computing attention scores in FP and performing the fused softmax and value-weighted sum in the log domain, H-FA replaces costly FP multiplies/divisions with fixed-point additions and shifts, while avoiding explicit exponent evaluations until the final step. Hardware results at 28 nm show that H-FA achieves ~26–27% area reduction and ~23% power reduction compared with FP-only FlashAttention accelerators, with negligible impact on LLM accuracy across multiple models and benchmarks. The approach leverages LogDiv for log-domain division and Mitchell’s approximation to maintain accuracy, offering a practical path to energy-efficient, scalable attention hardware for large language models.

Abstract

Transformers have significantly advanced AI and machine learning through their powerful attention mechanism. However, computing attention on long sequences can become a computational bottleneck. FlashAttention mitigates this by fusing the softmax and matrix operations into a tiled computation pattern that decouples performance from sequence length. Though designed for GPUs, its simplicity also makes it well suited for direct hardware acceleration. To improve hardware implementation, we compute FlashAttention using a mixture of floating-point and fixed-point logarithm domain representations. Floating-point is used to compute attention scores from query and key matrices, while logarithmic computation simplifies the fused computation of softmax normalization and the multiplication with the value matrix. This transformation, called H-FA, replaces vector-wide floating-point multiplication and division operations by additions and subtractions implemented efficiently with fixed-point arithmetic in the logarithm domain. Exponential function evaluations are effectively omitted and fused with the rest operations, and the final result is directly returned to floating-point arithmetic without any additional hardware overhead. Hardware implementation results at 28nm demonstrate that H-FA achieves a 26.5% reduction in area and a 23.4% reduction in power, on average, compared to FlashAttention parallel hardware architectures built solely with floating-point datapaths, without hindering performance.

Paper Structure

This paper contains 24 sections, 25 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: The organization of the FlashAttention Unit (FAU) serving one query vector. Multiple FAUs can serve multiple query vectors in parallel.
  • Figure 2: Block-level computation of attention for a single query vector. The set of key and value vectors are split into multiple blocks which are computed in parallel by the FlashAttention units (FAUs). The partial attention outputs are accumulated in the vertical direction using cascaded ACC units that implement Eq. \ref{['e:merge']}. The final normalized attention value is computed using division.
  • Figure 3: The optimized structure of the FAU operating partly in the logarithm domain.
  • Figure 4: Hardware architecture of optimized ACC blocks for accumulating partial attention results.
  • Figure 5: The distribution of input values (their absolute value) undergoing Mitchell’s approximation, along with the corresponding absolute error introduced in each case. By definition, these inputs fall within the interval [0, 1].
  • ...and 3 more figures