Table of Contents
Fetching ...

Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing

Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang

TL;DR

This work tackles the high cost of the prefilling phase in long-context LLMs by revealing that attention patterns are similar across heads and consistently so across inputs. It introduces SharePrefill, a dynamic sparse-attention method that offline-clusters heads to identify similar patterns and online-sharing mechanisms to reuse patterns across heads during inference, requiring full attention for only a subset of heads. Empirical results on InfiniteBench and PG-19 show SharePrefill matches or surpasses state-of-the-art speedups while achieving the best overall accuracy, demonstrating practical gains for long-context inference. The approach promises scalable acceleration with potential extensions to decoding and multi-modular systems, albeit leaving open questions about the fundamental causes of head similarity and large-scale deployment.

Abstract

Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.

Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing

TL;DR

This work tackles the high cost of the prefilling phase in long-context LLMs by revealing that attention patterns are similar across heads and consistently so across inputs. It introduces SharePrefill, a dynamic sparse-attention method that offline-clusters heads to identify similar patterns and online-sharing mechanisms to reuse patterns across heads during inference, requiring full attention for only a subset of heads. Empirical results on InfiniteBench and PG-19 show SharePrefill matches or surpasses state-of-the-art speedups while achieving the best overall accuracy, demonstrating practical gains for long-context inference. The approach promises scalable acceleration with potential extensions to decoding and multi-modular systems, albeit leaving open questions about the fundamental causes of head similarity and large-scale deployment.

Abstract

Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.

Paper Structure

This paper contains 28 sections, 2 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of our method with baselines across different models between latency (under 128K) and the average score on Infinitebench.
  • Figure 2: Attention patterns of different heads and their similarity matrices across various tasks.
  • Figure 3: Overview of proposed SharePrefill. Attention heads are clustered offline based on the similarity of their attention score maps to create a static head dictionary. During inference, each head retrieves its cluster index $C_i$. Pivotal Patterns are shared if available; otherwise, a dense pattern is assigned. The sparse attention output $\boldsymbol{O}$ is computed using $\boldsymbol{M}$, and $\boldsymbol{\tilde{A}}$ updates the dynamic pivotal pattern dictionary.
  • Figure 4: Perplexity results on PG-19 rae2019compressive using different models and methods.
  • Figure 5: Latency comparison of different approaches across various context lengths using different models.
  • ...and 1 more figures