SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao; Shuming Guo; Shijie Cao; Yuqing Xia; Yu Cheng; Lei Wang; Lingxiao Ma; Yutao Sun; Tianzhu Ye; Li Dong; Hayden Kwok-Hay So; Yu Hua; Ting Cao; Fan Yang; Mao Yang

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang

TL;DR

Long-context reasoning in autoregressive models is hampered by quadratic KV-cache costs. SeerAttention-R introduces a post-training, plug-in AttnGate that enables sparse decoding by learning shared sparsity within Grouped Query Attention, removing Q pooling for autoregressive decoding and leveraging a K-compression cache plus a fast block-sparse decoding kernel (TileLang). The approach achieves near-lossless reasoning with a 4K token budget on challenging benchmarks and delivers substantial hardware speedups (up to 9x over FlashAttention-3) at high sparsity. Training is lightweight (gate-only) and can be applied to multiple pretrained models, facilitating practical deployment for long-sequence reasoning tasks. The work also demonstrates strong kernel-level acceleration and provides thorough ablations, paving the way for end-to-end speedups and adaptive sparsity in future work.

Abstract

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

TL;DR

Abstract

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)