Table of Contents
Fetching ...

Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration

Arundhathi Dev, Justin Zhan

Abstract

Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform existing sparse attention baselines while closely matching dense attention quality. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO enables plug-and-play acceleration across diverse transformer architectures and domains.

Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration

Abstract

Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform existing sparse attention baselines while closely matching dense attention quality. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO enables plug-and-play acceleration across diverse transformer architectures and domains.
Paper Structure (35 sections, 7 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 7 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The AFBS-BO Framework Architecture. The automated tuning pipeline consists of three sequential stages: (1) Global Exploration utilizes Bayesian Optimization on low-fidelity surrogates to identify promising regions; (2) Local Refinement employs high-fidelity binary search for precision; and (3) Robust Validation ensures stability across inputs. The resulting layer-specific hyperparameters are injected into the kernel for plug-and-play acceleration.
  • Figure 2: Context Length Stability. Unlike Window Attention, which degrades catastrophically as sequence length exceeds the window size ($>$4k), AFBS-BO maintains stable perplexity up to 32k tokens, tracking the Dense Oracle within 0.3 PPL.
  • Figure 3: KV Cache Memory Scaling with Sequence Length. Dense attention (dashed line) hits the 16GB GPU memory ceiling at approximately 12K tokens, while AFBS-BO's sparse attention (solid line) scales sub-linearly, enabling processing of 32K+ token sequences on consumer GPUs. The 3.4$\times$ memory reduction demonstrated here translates directly to practical longer-context deployment.
  • Figure 4: Impact of Block Size ($B$) on Quality vs. Efficiency. We analyze the trade-off between semantic resolution (Perplexity, blue dashed) and inference throughput (Speed, red solid). Small blocks ($B < 64$) incur significant kernel overhead for marginal quality gains. Conversely, coarse blocks ($B > 64$) maximize speed but cause a sharp degradation in perplexity due to context aliasing. Our chosen block size of $B=64$ represents the optimal Pareto point, achieving near-peak throughput while maintaining model quality within the acceptable tolerance zone.
  • Figure 5: Optimization Convergence. AFBS-BO (Blue) rapidly reduces approximation error using Bayesian exploration, whereas Random Search (Grey) stagnates. The vertical drop at iteration 15 highlights the efficacy of Stage 2 (Binary Refinement) in identifying the precise optimal threshold.