Table of Contents
Fetching ...

Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao

TL;DR

This work targets the heavy cost of attention over large KV caches in long-context LLMs. It introduces Double-P, a hierarchical top-p sparse attention framework that first performs cluster-level attention estimation using size-weighted centroids, then adapts token-level computation through a second top-p stage to allocate exact attention only where it matters, producing a mixture of exact and centroid-based contributions. The approach is implemented with GPU-friendly kernels and a fused attention pipeline, maintaining a bound on attention mass while reducing estimation and sparse-attention costs. Empirical results on LLaMA-3.1-8B and Qwen-3-8B across RULER and LongBench show near-zero accuracy loss with substantial speedups, including up to 1.78x attention-level speedup and 1.26x end-to-end decoding speedup compared with strong baselines, highlighting practical scalability for long-context inference.

Abstract

As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.

Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs

TL;DR

This work targets the heavy cost of attention over large KV caches in long-context LLMs. It introduces Double-P, a hierarchical top-p sparse attention framework that first performs cluster-level attention estimation using size-weighted centroids, then adapts token-level computation through a second top-p stage to allocate exact attention only where it matters, producing a mixture of exact and centroid-based contributions. The approach is implemented with GPU-friendly kernels and a fused attention pipeline, maintaining a bound on attention mass while reducing estimation and sparse-attention costs. Empirical results on LLaMA-3.1-8B and Qwen-3-8B across RULER and LongBench show near-zero accuracy loss with substantial speedups, including up to 1.78x attention-level speedup and 1.26x end-to-end decoding speedup compared with strong baselines, highlighting practical scalability for long-context inference.

Abstract

As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.
Paper Structure (20 sections, 9 equations, 10 figures, 5 tables)

This paper contains 20 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Average accuracy and decode latency of different sparse attention methods tested on Ruler hsieh2024ruler 32k context length. Double-P dominates existing approaches, forming a Pareto frontier that demonstrates superior efficiency–accuracy balance.
  • Figure 2: Limitations of fixed token budgets for sparse attention.
  • Figure 3: Failure of fixed-budget token-level Top-P estimation. Each point shows the recovered attention mass and its budget ratio for an individual LongBench sample. Under the recommended fixed budget, 91.9% of samples fail to recover sufficient attention mass.
  • Figure 4: Latency Breakdown of Token-Level Top-p Sparse Attention. Shaded bars indicate Top-p estimation overhead. Token-level attention score estimation (SpGEMV) and Top-p selection account for a substantial fraction of total latency, indicating that top-p estimation overhead dominates performance even before sparse attention (SA) is applied.
  • Figure 5: Double-P framework overview. Double-P performs hierarchical top-p sparse attention by first estimating attention mass at the cluster level using size-weighted centroids, then adaptively refining token-level computation through a second top-p stage. Exact token attention and centroid-based approximations are combined to achieve high accuracy with low estimation and sparse attention cost.
  • ...and 5 more figures