Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing
Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, Jun Wang
TL;DR
This work tackles the high cost of the prefilling phase in long-context LLMs by revealing that attention patterns are similar across heads and consistently so across inputs. It introduces SharePrefill, a dynamic sparse-attention method that offline-clusters heads to identify similar patterns and online-sharing mechanisms to reuse patterns across heads during inference, requiring full attention for only a subset of heads. Empirical results on InfiniteBench and PG-19 show SharePrefill matches or surpasses state-of-the-art speedups while achieving the best overall accuracy, demonstrating practical gains for long-context inference. The approach promises scalable acceleration with potential extensions to decoding and multi-modular systems, albeit leaving open questions about the fundamental causes of head similarity and large-scale deployment.
Abstract
Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.
