Table of Contents
Fetching ...

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Yuxuan Hu, Jianchao Tan, Jiaqi Zhang, Wen Zan, Pingwei Sun, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai, Jing Zhang

TL;DR

The paper addresses the memory and computation bottlenecks of processing long sequences in transformer models by proposing Alternating Sparse Attention (ASA), which alternates local sliding-window and global compressed/selective attention across layers. ASA replaces the prior GQA mechanism with Multi-head Latent Attention (MLA) in the local branch and introduces Grouped-head Latent Attention (GLA) for the retrieval branch, while also implementing kernel optimizations to improve hardware efficiency. Empirical results on 340M and 1.3B parameter models trained on 15B and 100B tokens show that ASA can match or surpass full attention and NSA on common-sense reasoning, in-context retrieval, and long-context understanding, while reducing KV-cache storage by 50%. These findings demonstrate a scalable, memory-efficient approach to long-context modeling with practical implications for large-scale language models.

Abstract

In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50\% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.

Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

TL;DR

The paper addresses the memory and computation bottlenecks of processing long sequences in transformer models by proposing Alternating Sparse Attention (ASA), which alternates local sliding-window and global compressed/selective attention across layers. ASA replaces the prior GQA mechanism with Multi-head Latent Attention (MLA) in the local branch and introduces Grouped-head Latent Attention (GLA) for the retrieval branch, while also implementing kernel optimizations to improve hardware efficiency. Empirical results on 340M and 1.3B parameter models trained on 15B and 100B tokens show that ASA can match or surpass full attention and NSA on common-sense reasoning, in-context retrieval, and long-context understanding, while reducing KV-cache storage by 50%. These findings demonstrate a scalable, memory-efficient approach to long-context modeling with practical implications for large-scale language models.

Abstract

In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50\% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.

Paper Structure

This paper contains 14 sections, 6 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of Llama-like Models Incorporating Full Attention, Native Sparse Attention, and Alternating Sparse Attention.
  • Figure 2: Overview of ASA’s architecture. In ASA, consecutive attention layers alternate between compressed selective attention and sliding window attention. Furthermore, GLA and MLA replace the GQA mechanism used in NSA to enhance model expressiveness. To improve training efficiency, every fourth token in the compressed selective attention module inherits the block index of the first token in its block.
  • Figure 3: Computation time comparison for forward.