Table of Contents
Fetching ...

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm

TL;DR

DELTA tackles the decoding-time bottleneck in long-context reasoning by introducing a training-free, layer-aware sparse attention mechanism. It uses a three-tier design with early full-attention layers, Delta layers that recompute full attention to refresh a small set of salient tokens, and subsequent sparse layers that attend only to that subset, preserving the full KV cache. Through Unified Head Selection and a Stable Recency Window, DELTA maintains high recall with far fewer attended tokens, achieving up to a $1.5\times$ end-to-end speedup while matching or exceeding full-attention accuracy on difficult benchmarks. This approach offers a practical path to efficient long-context reasoning without retraining, with potential impact on latency and resource usage in real-world LLM serving.

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{selection layers} that identify salient tokens via aggregated head-level attention scores, and subsequent \emph{sparse-attention layers} that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $5\times$ and delivering $1.5\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

TL;DR

DELTA tackles the decoding-time bottleneck in long-context reasoning by introducing a training-free, layer-aware sparse attention mechanism. It uses a three-tier design with early full-attention layers, Delta layers that recompute full attention to refresh a small set of salient tokens, and subsequent sparse layers that attend only to that subset, preserving the full KV cache. Through Unified Head Selection and a Stable Recency Window, DELTA maintains high recall with far fewer attended tokens, achieving up to a end-to-end speedup while matching or exceeding full-attention accuracy on difficult benchmarks. This approach offers a practical path to efficient long-context reasoning without retraining, with potential impact on latency and resource usage in real-world LLM serving.

Abstract

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. Existing sparse attention methods reduce computation by pruning the key-value (KV) cache, yet they suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the dynamic importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{selection layers} that identify salient tokens via aggregated head-level attention scores, and subsequent \emph{sparse-attention layers} that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to and delivering end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.

Paper Structure

This paper contains 13 sections, 14 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: (Left) Attention maps from Qwen-7B at decoding steps 900 and 1000, where each row corresponds to a Transformer layer. (Right) Decoding runtime of FFN and attention modules across generation, showing attention’s linear growth with context length.
  • Figure 2: Overview of the DELTA decoding process. The first two layers perform full attention for initialization, $\Delta$-layers (e.g., Layers 2 and 14) run full attention to select salient tokens, and subsequent sparse attention layers attend only to those selected tokens, as indicated by green arrows.
  • Figure 3: Accuracy of sparse attention methods on reasoning benchmarks using Qwen-7B and 14B models. DELTA consistently matches or exceeds the accuracy of Full attention under limited token budgets and maintains robustness across different datasets.
  • Figure 4: (Left) CDF of decoding rounds across model-dataset pairs. DELTA reaches high CDF values faster than baselines by maintaining shorter generation lengths. (Right) End-to-end forward latency per decoding round. After DELTA activation (gray line), latency becomes lower than full attention.