Table of Contents
Fetching ...

ProxyAttn: Guided Sparse Attention via Representative Heads

Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che

Abstract

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

ProxyAttn: Guided Sparse Attention via Representative Heads

Abstract

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

Paper Structure

This paper contains 37 sections, 4 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance and speedup results of different sparse attention methods on RULER.
  • Figure 2: Observational study on Llama3.1-8B-Instruct using 8K-token data from the Needle in a Haystack (NIAH) Single-1 task of the RULER hsieh2024ruler dataset.
  • Figure 3: Illustration of ProxyAttn. By compressing the head dimension, ProxyAttn can obtain token-level importance scores, leading to more accurate block importance estimation. A proxy head is able to obtain diverse masks by leveraging the online budget estimations from different heads.
  • Figure 4: Kernel-level speedup of existing sparse attention methods with varying input lengths.
  • Figure 5: Experimental analysis of proposed method. (a) The performance of different numbers of proxy attention heads across various models. (b) Latency for estimating block importance with 128K inputs using different methods. The latency of all methods is less than 10% of the Full Attention. (c) Performance degradation with increasing sparsity rate for different budget allocation methods.
  • ...and 3 more figures