Table of Contents
Fetching ...

SpecAttn: Speculating Sparse Attention

Harsh Shah

TL;DR

SpecAttn tackles the quadratic complexity of self-attention by marrying speculative decoding with dynamic sparse attention in a training-free framework. It achieves this by (i) mapping draft-to-verifier layers via KL-divergence, (ii) selecting tokens with a sorting-free top-p nucleus mechanism, and (iii) pruning the KV-cache to focus on the most informative tokens, guided by draft attention. Empirical results on PG-19 with a TinyLlama draft and Llama-2 verifier show up to around 78% KV-cache reduction with modest perplexity increases (e.g., ~15% relative at p=0.95) and meaningful end-to-end speedups, outperforming existing sparse-attention approaches like Quest at similar sparsity. The work demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation, enabling more scalable inference for long-context LLMs.

Abstract

Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.

SpecAttn: Speculating Sparse Attention

TL;DR

SpecAttn tackles the quadratic complexity of self-attention by marrying speculative decoding with dynamic sparse attention in a training-free framework. It achieves this by (i) mapping draft-to-verifier layers via KL-divergence, (ii) selecting tokens with a sorting-free top-p nucleus mechanism, and (iii) pruning the KV-cache to focus on the most informative tokens, guided by draft attention. Empirical results on PG-19 with a TinyLlama draft and Llama-2 verifier show up to around 78% KV-cache reduction with modest perplexity increases (e.g., ~15% relative at p=0.95) and meaningful end-to-end speedups, outperforming existing sparse-attention approaches like Quest at similar sparsity. The work demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation, enabling more scalable inference for long-context LLMs.

Abstract

Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 2 tables, 3 algorithms.

Figures (5)

  • Figure 1: SpecAttn framework. Layer specific tokens are dynamically selected during runtime from draft model to prune KV cache of target model. Notice that a layer draft model can be mapped to multiple layers in verifier model. Also, the sorting described here is just for illustration, the implementation described in this paper performs the selection without sorting, Algo \ref{['alg:sorting_free_nucleus']}
  • Figure 2: Heatmap of KL divergence between layers of TinyLlama-1.1B as small model (draft model) and Llama-2-7b-hf as large model (verifier model). Red boxes denote the draft layer selected for the corresponding verifier layer.
  • Figure 3: Graph of time taken for mask generation when comparing sorting in pytorch with sorting-free algorithm as depicted in Algo \ref{['alg:sorting_free_nucleus']}
  • Figure 4: Speedup achieved in computing attention through FlashInfer's BlockSparseAttention with p=0.97 as compared to p=1.0 (i.e., full attention)
  • Figure 5: Perplexity (lower is better) comparison across different sparse attention methods. Here Baseline refers to the vanilla full attention decoding (StreamingLLM is omitted due to relatively high perplexity).