Trainable Dynamic Mask Sparse Attention

Jingze Shi; Yifan Wu; Yiran Peng; Bingheng Wu; Liangdong Wang; Guang Liu; Yuyu Luo

Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo

TL;DR

This work introduces Dynamic Mask Attention (DMA), a trainable dual-aware sparse attention mechanism that combines content-aware dynamic masking with position-aware sparse weights to address the quadratic bottleneck of self-attention in long-context modeling. A dedicated CUDA kernel fuses FlashAttention-like tiling with hardware-efficient skip logic, reducing time from $O(n^2)$ to $O(n \cdot w)$ and memory to $O(n \cdot w)$ for window size $w \ll n$, while preserving end-to-end differentiability. Extensive experiments across scaling laws, multi-query associative recall, downstream benchmarks, and extrapolated retrieval demonstrate that DMA offers a consistent Pareto advantage over state-of-the-art sparse baselines, with up to 10x speedups. The work also provides open-source kernel code to facilitate adoption and further research in efficient long-context transformers.

Abstract

The increasing demand for long-context modeling in large language models (LLMs) is bottlenecked by the quadratic complexity of the standard self-attention mechanism. The community has proposed sparse attention to mitigate this issue. However, position-aware sparse attention methods rely on static sparse structures that lack adaptability to diverse query contexts, while content-aware sparse attention methods depend on heuristic key-value selection, hindering full differentiability. We introduce a trainable dynamic mask sparse attention mechanism, a method that merges the advantages of both position-aware and content-aware approaches. Dynamic Mask Attention (DMA) achieves this through three key innovations: First, it leverages value vector representations to generate content-aware dynamic masks, enabling the model to adaptively identify and attend to critical information. Second, it computes position-aware sparse weights in a hardware-friendly manner, efficiently skipping unnecessary computational regions. Finally, we demonstrate that the introduced dynamic mask and sparse weights do not obstruct gradients, supporting end-to-end training. We have validated the performance of DMA through comprehensive experiments. A large body of experimental evidence shows that DMA consistently holds a Pareto advantage over state-of-the-art sparse attention baselines in tasks including scaling laws, multi-query associative recall, standard benchmarks, and needle in a haystack tests, while also delivering up to a 10x overall speedup. These results highlight its ability to effectively balance model efficiency with long-context modeling capabilities. Our computational kernel code is now open-source at https://github.com/SmallDoges/flash-dmattn to encourage further research and application by the community.

Trainable Dynamic Mask Sparse Attention

TL;DR

Abstract

Trainable Dynamic Mask Sparse Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)