Table of Contents
Fetching ...

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi

TL;DR

The paper tackles the quadratic complexity of standard dot-product attention in vision models by proposing Generalized Neighborhood Attention (GNA), which introduces a stride parameter to unify sliding-window NA, strided sliding-window, and blocked attention. It contributes NATTEN Sim, a detailed simulator that estimates fine-grained upper-bound speedups across tiling strategies and hardware constraints, and a Blackwell-based FMHA kernel implementation that realizes substantial end-to-end gains. The authors demonstrate that, in highly block-sparse scenarios, GNA can match the analytical speedups predicted by the simulator, achieving end-to-end improvements on Cosmos-7B, HunyuanVideo, and FLUX without fine-tuning. All code and tools are open-sourced, providing a practical pathway to deploy fast, locality-focused sparse attention in large-scale vision models, with potential to bridge the gap between sparse and dense attention performance. The work formalizes a pathway toward Speed-of-Light local attention by combining a flexible sparse-pattern family, an analytical speedup model, and architecture-aware implementations.

Abstract

Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

TL;DR

The paper tackles the quadratic complexity of standard dot-product attention in vision models by proposing Generalized Neighborhood Attention (GNA), which introduces a stride parameter to unify sliding-window NA, strided sliding-window, and blocked attention. It contributes NATTEN Sim, a detailed simulator that estimates fine-grained upper-bound speedups across tiling strategies and hardware constraints, and a Blackwell-based FMHA kernel implementation that realizes substantial end-to-end gains. The authors demonstrate that, in highly block-sparse scenarios, GNA can match the analytical speedups predicted by the simulator, achieving end-to-end improvements on Cosmos-7B, HunyuanVideo, and FLUX without fine-tuning. All code and tools are open-sourced, providing a practical pathway to deploy fast, locality-focused sparse attention in large-scale vision models, with potential to bridge the gap between sparse and dense attention performance. The work formalizes a pathway toward Speed-of-Light local attention by combining a flexible sparse-pattern family, an analytical speedup model, and architecture-aware implementations.

Abstract

Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.

Paper Structure

This paper contains 23 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Generalized Neighborhood Attention adds a new "stride" parameter to neighborhood attention, which introduces a delay in the sliding window by grouping queries together and forcing them to share their context window. This increases density in matrix multiplications, while maintaining overall sparsity, leading to speedups more proportional to savings in FLOPs. Our Blackwell kernel can as a result realize the maximum speedup theoretically possible in some cases, with respect to both FLOPs and simulation.
  • Figure 2: Generalized Neighborhood Attention allows customizable "delay steps" in the sliding window pattern through the new stride parameter. Stride can take any positive integer value smaller than or equal to window size. Stride of 1 is equivalent to standard neighborhood attention. When stride is equal to window size, it is equivalent to blocked attention (a.k.a. Window Self Attention).
  • Figure 3: Curse of multi-dimensionality: single-dimensional tiling opens up sparsity in multi-dimensional layouts of tokens to more wasted computation (FLOPs that are still computed but masked prior to softmax). While many fine-grained attention masks, even 1-D sliding window attention (top), can still have some FLOPs masked due to the fact that the vector-matrix multiplies are packed into matrix-matrix multiplies, masked FLOPs due to multi-dimensionality can be much more significant (bottom). Note that the single-dimensional case is bi-directional and not causal for better comparison to the multi-dimensional case.
  • Figure 4: Sweep of different stride values and their analytical speedup according to $\mathcal{N}\space ATTEN$Sim, with a window size of 18 × 24 × 24 ($\approx 91\%$ sparsity). The simulation assumes a Q tile shape of 4 × 8 × 8, and KV tile shape of 2 × 8 × 8, a combination supported by our Blackwell kernel. Standard neighborhood attention (stride 1) is limited by a 3.3× speedup, while some larger strides can improve upon that, and eventually cross 9× speedup with a stride of 8 × 8 across the spatial axes. With some larger strides along the temporal axis, it can reach perfect block-sparsity, and yield a speedup of 11.1×, which is equivalent to its FLOP-wise speedup.
  • Figure 5: Operation-level (attention only) speedups on Cosmos, FLUX, and HunyuanVideo with $\approx 90\%$ sparsity through GNA. Analytical speedup is according to $\mathcal{N}\space ATTEN$Sim, and actual speedups are measured by running on B200. Note that with the perfectly block-sparse strides, our kernel can come very close to or fully match the full the analytical speedup, but is limited by the naive implementation of the the memory operation (token permute).
  • ...and 3 more figures