Accelerating Attention with Basis Decomposition
Jialin Zhao
TL;DR
BD Attention (BDA) delivers a lossless algorithmic reformulation of multi-head attention via Basis Decomposition, expressing weight products as BD representations to reduce parameters and computation while preserving exact outputs. The method applies offline preparation to decompose projections (e.g., $W_q^i W_k^{i\top}$) and reconstructs them efficiently during inference with aligned bases across heads, ensuring $Q'_i (K'_i)^{\top} = Q_i K_i^{\top}$. Empirical results show 1.32×–1.34× end-to-end speedups on GPUs, ~25% fewer parameters, and negligible perplexity increases, with additional gains when combined with low-rank pruning. BDA is architecture-agnostic and complementary to existing optimizations like FlashAttention, offering a solid theoretical and practical pathway for lossless attention acceleration in both inference and training contexts. The work also demonstrates BD’s compatibility with training dynamics and low-rank compression, suggesting broad applicability across LLMs and VLMs."
Abstract
Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.
