Table of Contents
Fetching ...

Accelerating Attention with Basis Decomposition

Jialin Zhao

TL;DR

BD Attention (BDA) delivers a lossless algorithmic reformulation of multi-head attention via Basis Decomposition, expressing weight products as BD representations to reduce parameters and computation while preserving exact outputs. The method applies offline preparation to decompose projections (e.g., $W_q^i W_k^{i\top}$) and reconstructs them efficiently during inference with aligned bases across heads, ensuring $Q'_i (K'_i)^{\top} = Q_i K_i^{\top}$. Empirical results show 1.32×–1.34× end-to-end speedups on GPUs, ~25% fewer parameters, and negligible perplexity increases, with additional gains when combined with low-rank pruning. BDA is architecture-agnostic and complementary to existing optimizations like FlashAttention, offering a solid theoretical and practical pathway for lossless attention acceleration in both inference and training contexts. The work also demonstrates BD’s compatibility with training dynamics and low-rank compression, suggesting broad applicability across LLMs and VLMs."

Abstract

Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.

Accelerating Attention with Basis Decomposition

TL;DR

BD Attention (BDA) delivers a lossless algorithmic reformulation of multi-head attention via Basis Decomposition, expressing weight products as BD representations to reduce parameters and computation while preserving exact outputs. The method applies offline preparation to decompose projections (e.g., ) and reconstructs them efficiently during inference with aligned bases across heads, ensuring . Empirical results show 1.32×–1.34× end-to-end speedups on GPUs, ~25% fewer parameters, and negligible perplexity increases, with additional gains when combined with low-rank pruning. BDA is architecture-agnostic and complementary to existing optimizations like FlashAttention, offering a solid theoretical and practical pathway for lossless attention acceleration in both inference and training contexts. The work also demonstrates BD’s compatibility with training dynamics and low-rank compression, suggesting broad applicability across LLMs and VLMs."

Abstract

Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.

Paper Structure

This paper contains 28 sections, 2 theorems, 20 equations, 3 figures, 7 tables, 5 algorithms.

Key Result

Theorem 3.1

Let $\mathbf{W}$ be an $r \times r$ real random matrix. Suppose the entries of $\mathbf{W}$ are drawn from a probability measure $\mu$ on $\mathbb{R}^{r^2}$ that is absolutely continuous with respect to the Lebesgue measure $\lambda$. Then $W$ has full rank ($\mathrm{rank}(W)=r$) with probability 1.

Figures (3)

  • Figure 1: Illustration of BD Attention (BDA) using the QK projection as an example (VO is analogous). BDA consists of two stages: (a) BD Attention Preparation (Algorithm \ref{['alg:mha_bd_prepare']}), performed offline once during model deployment, where the projection matrices are transformed via Basis Decomposition; (b) BD Attention Inference (Algorithm \ref{['alg:mha_bd_inference']}) saves $d_h / d$ in both parameters and computation, while preserving exact outputs.
  • Figure 2: Evaluation of BD Attention (BDA).(a) End-to-end accuracy: Perplexity ($\downarrow$) increase on WikiText2 when replacing all MHA layers of DeepSeek-V2-Lite with BDA. The increase is nearly imperceptible 0.02% (FP16), with Residual-min performing better. For reference, the dashed line shows the degradation from a structured pruning baseline at the same compression ratio ($25\%$ K/V channels). (b) Efficiency: Relative speedup for the $k\_proj$ operator under FP16 and BF16. The dashed line at $1.33\times$ marks the theoretical bound. Measured speedups fluctuate around this line but consistently exceed the MHA baseline, averaging 1.32$\times$ (FP16) and 1.34$\times$ (BF16). BDA also reduces parameter and memory usage by 25%.
  • Figure : MHA Inference

Theorems & Definitions (3)

  • Theorem 3.1: Almost Sure Full Rank of Random Matrices
  • Theorem A.1: Almost Sure Full Rank of Random Matrices
  • proof