Table of Contents
Fetching ...

A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention

Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang, Yifan Lu, Xunliang Cai

TL;DR

This work examines the promises and challenges of native Top-$k$ sparse attention for long-context LLMs. It demonstrates that exact Top-$k$ decoding can match or exceed full attention performance at low sparsity, and that training with Top-$k$ attention further boosts results. The study also analyzes the impact of approximate retrieval on performance, showing a positive correlation with retrieval fidelity, and provides an entropy-based theoretical lens to explain why low-entropy states benefit Top-$k$ decoding. Collectively, these findings guide the design of scalable, sparse attention mechanisms for long-context reasoning in LLMs.

Abstract

Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.

A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention

TL;DR

This work examines the promises and challenges of native Top- sparse attention for long-context LLMs. It demonstrates that exact Top- decoding can match or exceed full attention performance at low sparsity, and that training with Top- attention further boosts results. The study also analyzes the impact of approximate retrieval on performance, showing a positive correlation with retrieval fidelity, and provides an entropy-based theoretical lens to explain why low-entropy states benefit Top- decoding. Collectively, these findings guide the design of scalable, sparse attention mechanisms for long-context reasoning in LLMs.

Abstract

Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top- Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top- Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top- Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top- Attention operations facilitates the further unlocking of Top- Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top- Attention, we investigate the impact of approximate Top- algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top- Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top- Decoding.

Paper Structure

This paper contains 5 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Performance of Top-$k$ Decoding on different benchmarks.
  • Figure 2:
  • Figure 3: Top-$k$ Decoding precision-performance curve on the 128K variant of HELMET benchmark for Llama-3-8B-ProLong-512k-Instruct with context window size being 2048.
  • Figure 4: Visualization of layer-wise statistics for four representative datasets. Top-left: Dataset A; Top-right: Dataset B; Bottom-left: Dataset C; Bottom-right: Dataset D. These plots demonstrate the variance in feature distributions across different layers.
  • Figure 5: Attention entropy reduction (%) of Llama-3-8B-ProLong-Instruct-512K-TopK-SFT compared to Llama-3-8B-ProLong-512k-Instruct on the 8K variant of HELMET benchmark.

Theorems & Definitions (2)

  • Definition 1: Top-$k$ Ratio
  • Definition 2: Retrieval Precision