Beyond Classical Attention: Quantum Attention for Scalable Computation

Xuyang Guo; Zhao Song; Xin Yang; Ruizhe Zhang

Beyond Classical Attention: Quantum Attention for Scalable Computation

Xuyang Guo, Zhao Song, Xin Yang, Ruizhe Zhang

TL;DR

This work tackles the quadratic bottleneck of Transformer attention in large language models by exploiting a sparsity pattern in $QK^\top$ under a $(\tau,k)$-good model and applying Grover's search to identify the large entries. It introduces a quantum algorithm that outputs a sparse-plus-rank-one approximation $B=B_1+B_2$ to $A=\exp(QK^{\top})$ with $\| D(A)^{-1} A - D(B)^{-1} B \|_{\infty} = O(\eta)$ and construction time $\tilde{O}( n( \sqrt{nk} d + kd) )$, with inference time $\tilde{O}( n^{1.5} k^{0.5} d + nkd )$ and accuracy $O(\eta^2)$. This yields a polynomial speedup for attention computation during inference via the sparse-plus-rank-one structure, while a classical half-space-reporting analogue and a fine-grained SETH-based lower bound contextualize the limits of classical approaches. The paper also outlines potential future directions for integrating quantum subroutines into training and exploring QRAM-aware architectures to further improve runtime and memory efficiency in Transformer workloads.

Abstract

As large language models (LLMs) demonstrate outstanding performance across various tasks, attention-driven models have profoundly transformed the field of machine learning. Since attention computations account for the primary computational overhead in both model inference and training, efficiently computing attention matrices has become one of the core challenges in accelerating large language models. It is well-known that quantum machines possess computational advantages over classical machines, and the role of quantum computing in LLMs remains largely unexplored. In this work, we focus on leveraging the Grover search algorithm to efficiently compute a sparse attention matrix. Through comparisons with classical algorithms, we demonstrate that our method achieves quantum acceleration in polynomial time. Additionally, we observe that the generated quantum attention matrices naturally exhibit low-rank structures, providing further theoretical support for efficient modeling. Moreover, within the specific context of attention matrix computation, we conduct a systematic and detailed analysis of the error and time complexity of the proposed algorithm.

Beyond Classical Attention: Quantum Attention for Scalable Computation

TL;DR

This work tackles the quadratic bottleneck of Transformer attention in large language models by exploiting a sparsity pattern in

under a

-good model and applying Grover's search to identify the large entries. It introduces a quantum algorithm that outputs a sparse-plus-rank-one approximation

with

and construction time

, with inference time

and accuracy

. This yields a polynomial speedup for attention computation during inference via the sparse-plus-rank-one structure, while a classical half-space-reporting analogue and a fine-grained SETH-based lower bound contextualize the limits of classical approaches. The paper also outlines potential future directions for integrating quantum subroutines into training and exploring QRAM-aware architectures to further improve runtime and memory efficiency in Transformer workloads.

Abstract

Paper Structure (39 sections, 15 theorems, 26 equations, 3 algorithms)

This paper contains 39 sections, 15 theorems, 26 equations, 3 algorithms.

Introduction
Our Results
Related Work
Attention Computation.
Classical fast neural network training algorithms
Quantum algorithms for training neural networks
Quantum optimization algorithms
Quantum machine learning
Roadmap
Preliminary
Notations
Grover's Search
Sparsity and Perturbation Error
Sparsity Definitions
Perturbation Tools
...and 24 more sections

Key Result

Theorem 1.3

Let $A \in \mathbb{R}^{n\times n},Q \in \mathbb{R}^{n \times d},K\in \mathbb{R}^{n\times d}$ and $D\in \mathbb{R}^{n\times n}$ be defined as in Definition def:attention_matrix. If the following conditions hold Then, there exists a quantum algorithm (implicitly) outputting a matrix $B\in \mathbb{R}^{n\times n}$ such that

Theorems & Definitions (33)

Definition 1.1
Definition 1.2
Theorem 1.3: Quantum algorithm for attention matrix approximation
Theorem 1.4: Informal version of Theorem \ref{['thm:main_result:formal']}
Theorem 1.5: Classical algorithm for attention matrix approximation
Theorem 3.1: Grover's search algorithm g96
Definition 4.1: Find Set
Definition 4.2
Lemma 4.3
Lemma 4.4
...and 23 more

Beyond Classical Attention: Quantum Attention for Scalable Computation

TL;DR

Abstract

Beyond Classical Attention: Quantum Attention for Scalable Computation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (33)