Table of Contents
Fetching ...

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, Binhang Yuan

TL;DR

This work tackles the high $O(N^2)$ costs of long-context attention in large language models by enhancing sparse attention via Flash Sparse Attention (FSA). FSA inverts NSA's kernel loop to process KV blocks first, removing padding and reducing memory traffic, while employing index-based token selection, an online softmax pass, and a two-stage reduction. The authors provide a theoretical memory/FLOP analysis and extensive experiments on NVIDIA H20/H200 GPUs, reporting kernel-level speedups up to $3.5\times$, end-to-end training speedups up to $1.25\times$, and inference prefill speedups up to $1.36\times$ over NSA, with larger gains relative to full attention. The results demonstrate robust improvements across varied GQA group sizes and sequence lengths, underscoring the importance of algorithm–system co-design for practical hardware-efficient sparse attention.

Abstract

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

TL;DR

This work tackles the high costs of long-context attention in large language models by enhancing sparse attention via Flash Sparse Attention (FSA). FSA inverts NSA's kernel loop to process KV blocks first, removing padding and reducing memory traffic, while employing index-based token selection, an online softmax pass, and a two-stage reduction. The authors provide a theoretical memory/FLOP analysis and extensive experiments on NVIDIA H20/H200 GPUs, reporting kernel-level speedups up to , end-to-end training speedups up to , and inference prefill speedups up to over NSA, with larger gains relative to full attention. The results demonstrate robust improvements across varied GQA group sizes and sequence lengths, underscoring the importance of algorithm–system co-design for practical hardware-efficient sparse attention.

Abstract

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

Paper Structure

This paper contains 18 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Left: Illustration of NSA kernel yuan2025native, which iterates query tokens in outer loop, and processes KV blocks in the inner loop. Right: Illustration of FSA kernel, which alternatively iterate KV blocks in the outer loop, and processes query tokens in the inner loop --- partial attention results are stored in output buffer $\mathbf{O}_{\text{buf}}$ for accumulation (see §\ref{['sec:fsa-impl']} for more details).
  • Figure 2: Comparison on memory access and FLOPs, block size is 64, top-k is 16. FSA's memory volume or FLOPs are normalized to 1.
  • Figure 3: Real-time profiling results of the FSA and NSA kernel execution overhead across different GPUs, under block size $B_K=64$, and top-k value $T=16$. FSA latency is normalized to 1.
  • Figure 4: Performance comparison of Triton-based FSA, NSA, and full attention (enabled by Flash Attention) kernels under block sizes and top-k values of ($B_K$, $T$) equals to ($64$, $16$) and ($128$, $8$).
  • Figure 5: End-to-end training latency of FSA, NSA, full attention.
  • ...and 6 more figures