Beyond Classical Attention: Quantum Attention for Scalable Computation
Xuyang Guo, Zhao Song, Xin Yang, Ruizhe Zhang
TL;DR
This work tackles the quadratic bottleneck of Transformer attention in large language models by exploiting a sparsity pattern in $QK^\top$ under a $(\tau,k)$-good model and applying Grover's search to identify the large entries. It introduces a quantum algorithm that outputs a sparse-plus-rank-one approximation $B=B_1+B_2$ to $A=\exp(QK^{\top})$ with $\| D(A)^{-1} A - D(B)^{-1} B \|_{\infty} = O(\eta)$ and construction time $\tilde{O}( n( \sqrt{nk} d + kd) )$, with inference time $\tilde{O}( n^{1.5} k^{0.5} d + nkd )$ and accuracy $O(\eta^2)$. This yields a polynomial speedup for attention computation during inference via the sparse-plus-rank-one structure, while a classical half-space-reporting analogue and a fine-grained SETH-based lower bound contextualize the limits of classical approaches. The paper also outlines potential future directions for integrating quantum subroutines into training and exploring QRAM-aware architectures to further improve runtime and memory efficiency in Transformer workloads.
Abstract
As large language models (LLMs) demonstrate outstanding performance across various tasks, attention-driven models have profoundly transformed the field of machine learning. Since attention computations account for the primary computational overhead in both model inference and training, efficiently computing attention matrices has become one of the core challenges in accelerating large language models. It is well-known that quantum machines possess computational advantages over classical machines, and the role of quantum computing in LLMs remains largely unexplored. In this work, we focus on leveraging the Grover search algorithm to efficiently compute a sparse attention matrix. Through comparisons with classical algorithms, we demonstrate that our method achieves quantum acceleration in polynomial time. Additionally, we observe that the generated quantum attention matrices naturally exhibit low-rank structures, providing further theoretical support for efficient modeling. Moreover, within the specific context of attention matrix computation, we conduct a systematic and detailed analysis of the error and time complexity of the proposed algorithm.
