Table of Contents
Fetching ...

FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill

Rakshith Jayanth, Viktor Prasanna

TL;DR

FAST-Prefill, the first FPGA accelerator for long-context prefill-stage inference with dynamic sparse attention, is proposed and an average speedup of up to 2.5% in TTFT and 4.5% in energy efficiency over GPU implementation on Nvidia A5000 GPU is demonstrated.

Abstract

In long-context large language model (LLM) inference, the prefill stage dominates computation due to self-attention over the complete input context. Sparse attention significantly reduces self-attention computation by limiting each token's interactions to a subset of tokens. The attention sparsity pattern varies across input prompts, and within a prompt, each attention head can follow a distinct pattern. This makes attention sparsity dynamic. The requirement of generating the sparsity pattern, combined with limited data reuse in attention, shifts the prefill compute to being memory-bound. This, in addition to the huge energy requirements for long-context inference on GPU, motivates FPGAs as good candidates for accelerating dynamic long-context inference. To tackle these challenges, we propose FAST-Prefill, the first FPGA accelerator for long-context prefill-stage inference with dynamic sparse attention. To efficiently generate sparse indices, we propose a \textit{fused pipeline unit with a memory-aware execution order} to reduce large tensors and irregular memory accesses. To reduce off-chip memory traffic for accessing the KV cache, we utilize the memory hierarchy to design a \textit{liveness-driven, dual-tier cache}. For high-throughput matrix multiplication, we design a \textit{hybrid Matrix Processing Unit (MPU)} with DSPs and bit-plane decomposition using LUTs. We implement FAST-Prefill on Alveo U280 and evaluate it on the Llama and Qwen models (batch size = 1) for context lengths ranging from 4K to 128K tokens. We demonstrate an average speedup of up to 2.5$\times$ in TTFT and 4.5$\times$ improvement in energy efficiency over GPU implementation on Nvidia A5000 GPU.

FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill

TL;DR

FAST-Prefill, the first FPGA accelerator for long-context prefill-stage inference with dynamic sparse attention, is proposed and an average speedup of up to 2.5% in TTFT and 4.5% in energy efficiency over GPU implementation on Nvidia A5000 GPU is demonstrated.

Abstract

In long-context large language model (LLM) inference, the prefill stage dominates computation due to self-attention over the complete input context. Sparse attention significantly reduces self-attention computation by limiting each token's interactions to a subset of tokens. The attention sparsity pattern varies across input prompts, and within a prompt, each attention head can follow a distinct pattern. This makes attention sparsity dynamic. The requirement of generating the sparsity pattern, combined with limited data reuse in attention, shifts the prefill compute to being memory-bound. This, in addition to the huge energy requirements for long-context inference on GPU, motivates FPGAs as good candidates for accelerating dynamic long-context inference. To tackle these challenges, we propose FAST-Prefill, the first FPGA accelerator for long-context prefill-stage inference with dynamic sparse attention. To efficiently generate sparse indices, we propose a \textit{fused pipeline unit with a memory-aware execution order} to reduce large tensors and irregular memory accesses. To reduce off-chip memory traffic for accessing the KV cache, we utilize the memory hierarchy to design a \textit{liveness-driven, dual-tier cache}. For high-throughput matrix multiplication, we design a \textit{hybrid Matrix Processing Unit (MPU)} with DSPs and bit-plane decomposition using LUTs. We implement FAST-Prefill on Alveo U280 and evaluate it on the Llama and Qwen models (batch size = 1) for context lengths ranging from 4K to 128K tokens. We demonstrate an average speedup of up to 2.5 in TTFT and 4.5 improvement in energy efficiency over GPU implementation on Nvidia A5000 GPU.
Paper Structure (24 sections, 8 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: FAST-Prefill Architecture
  • Figure 2: Prefill Workflow with Sparse Attention
  • Figure 3: Sparse Index Generation Unit Workflow
  • Figure 4: Sparse Attention Unit Workflow
  • Figure 5: Comparing TTFT of FAST-Prefill with the Baseline GPU implementation
  • ...and 3 more figures