A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

Neelesh Gupta; Peter Wang; Rajgopal Kannan; Viktor K. Prasanna

A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

Neelesh Gupta, Peter Wang, Rajgopal Kannan, Viktor K. Prasanna

TL;DR

An FPGA accelerator is presented that eliminates the memory-bound GDN decode bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound.

Abstract

Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token. We show that this bottleneck is architectural, not algorithmic, as all subquadratic sequence models exhibit arithmetic intensities below 1 FLOP/B at decode time, making them more memory-bound than standard Transformers. We present an FPGA accelerator that eliminates this bottleneck by holding the full 2 MB recurrent state persistently in on-chip BRAM, converting the workload from memory-bound to compute-bound. Our design fuses the GDN recurrence into a five-phase pipelined datapath that performs only one read and one write pass over each state matrix per token, exploits Grouped Value Attention for paired-head parallelism, and overlaps preparation, computation, and output storage via dataflow pipelining. We explore four design points on an AMD Alveo U55C using Vitis HLS, varying head-level parallelism from 2 to 16 value-heads per iteration. Our fastest configuration achieves 63 $μ$s per token, 4.5$\times$ faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60$\times$ greater energy efficiency per token decoded.

A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

TL;DR

Abstract

s per token, 4.5

faster than the GPU reference on NVIDIA H100 PCIe. Post-implementation power analysis reports 9.96 W on-chip, yielding up to 60

greater energy efficiency per token decoded.

Paper Structure (26 sections, 8 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 26 sections, 8 equations, 6 figures, 6 tables, 2 algorithms.

Introduction
Background
Qwen3-Next and Hybrid LLM Architectures
Prefill vs. Decode
Gated DeltaNet Recurrence
Performance Model
Problem Formulation
Baseline Cost Analysis
Latency Decomposition
Architecture Design
Persistent On-Chip State
Fused Five-Phase Compute Pipeline
GVA-Aware Paired-Head Parallelism
Dataflow Pipelining
System Overview
...and 11 more sections

Figures (6)

Figure 1: Batch-1 decode arithmetic intensity on the H100 FP32 roofline. All architectures fall deep in the memory-bound regime. MHSA Transformers vaswani2017attention achieve an arithmetic intensity near 1 FLOP/B whereas subquadratic models (GDN yang2025gated, DeltaNet yang2024deltanet, Mamba gu2024mamba, Mamba-2 dao2024mamba2) fall well below 1 FLOP/B.
Figure 2: Qwen3-Next qwen2025qwen3next hybrid architecture. GDN layers (3:1 ratio) replace softmax attention with fixed-size recurrent state, making GDN decode the dominant per-token primitive.
Figure 3: System architecture for $H_{\mathrm{iter}}{=}8$. Per-token inputs arrive via three AXI ports (left) and are buffered in on-chip BRAM. The prepare stage computes gates $g_t$, $\beta_t$. Four GVA pairs each share a q/k datapath across two PEs; each PE executes the fused five-phase pipeline with $P_K{=}16$ column parallelism (MAC$\times$16, II$=$1). The 128 dual-port BRAM arrays (bottom) hold the persistent 2 MB state, partitioned by head and column bank, with all four iteration slices sharing the same physical banks at different addresses.
Figure 4: Per-token latency comparison for batch-1 GDN decode. All FPGA configurations outperform the GPU baseline. $H_{\mathrm{iter}}{=}8$ is optimal; $H_{\mathrm{iter}}{=}16$ regresses due to pipeline interval inflation.
Figure 5: FPGA resource utilization and latency across configurations. BRAM plateaus at 25%; DSP, FF, and LUT scale with $H_{\mathrm{iter}}$. Latency (right axis) reaches a minimum at $H_{\mathrm{iter}}{=}8$; $H_{\mathrm{iter}}{=}16$ increases latency due to pipeline interval inflation despite fewer iterations.
...and 1 more figures

A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

TL;DR

Abstract

A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

Authors

TL;DR

Abstract

Table of Contents

Figures (6)