Table of Contents
Fetching ...

Beyond Classical Attention: Quantum Attention for Scalable Computation

Xuyang Guo, Zhao Song, Xin Yang, Ruizhe Zhang

TL;DR

This work tackles the quadratic bottleneck of Transformer attention in large language models by exploiting a sparsity pattern in $QK^\top$ under a $(\tau,k)$-good model and applying Grover's search to identify the large entries. It introduces a quantum algorithm that outputs a sparse-plus-rank-one approximation $B=B_1+B_2$ to $A=\exp(QK^{\top})$ with $\| D(A)^{-1} A - D(B)^{-1} B \|_{\infty} = O(\eta)$ and construction time $\tilde{O}( n( \sqrt{nk} d + kd) )$, with inference time $\tilde{O}( n^{1.5} k^{0.5} d + nkd )$ and accuracy $O(\eta^2)$. This yields a polynomial speedup for attention computation during inference via the sparse-plus-rank-one structure, while a classical half-space-reporting analogue and a fine-grained SETH-based lower bound contextualize the limits of classical approaches. The paper also outlines potential future directions for integrating quantum subroutines into training and exploring QRAM-aware architectures to further improve runtime and memory efficiency in Transformer workloads.

Abstract

As large language models (LLMs) demonstrate outstanding performance across various tasks, attention-driven models have profoundly transformed the field of machine learning. Since attention computations account for the primary computational overhead in both model inference and training, efficiently computing attention matrices has become one of the core challenges in accelerating large language models. It is well-known that quantum machines possess computational advantages over classical machines, and the role of quantum computing in LLMs remains largely unexplored. In this work, we focus on leveraging the Grover search algorithm to efficiently compute a sparse attention matrix. Through comparisons with classical algorithms, we demonstrate that our method achieves quantum acceleration in polynomial time. Additionally, we observe that the generated quantum attention matrices naturally exhibit low-rank structures, providing further theoretical support for efficient modeling. Moreover, within the specific context of attention matrix computation, we conduct a systematic and detailed analysis of the error and time complexity of the proposed algorithm.

Beyond Classical Attention: Quantum Attention for Scalable Computation

TL;DR

This work tackles the quadratic bottleneck of Transformer attention in large language models by exploiting a sparsity pattern in under a -good model and applying Grover's search to identify the large entries. It introduces a quantum algorithm that outputs a sparse-plus-rank-one approximation to with and construction time , with inference time and accuracy . This yields a polynomial speedup for attention computation during inference via the sparse-plus-rank-one structure, while a classical half-space-reporting analogue and a fine-grained SETH-based lower bound contextualize the limits of classical approaches. The paper also outlines potential future directions for integrating quantum subroutines into training and exploring QRAM-aware architectures to further improve runtime and memory efficiency in Transformer workloads.

Abstract

As large language models (LLMs) demonstrate outstanding performance across various tasks, attention-driven models have profoundly transformed the field of machine learning. Since attention computations account for the primary computational overhead in both model inference and training, efficiently computing attention matrices has become one of the core challenges in accelerating large language models. It is well-known that quantum machines possess computational advantages over classical machines, and the role of quantum computing in LLMs remains largely unexplored. In this work, we focus on leveraging the Grover search algorithm to efficiently compute a sparse attention matrix. Through comparisons with classical algorithms, we demonstrate that our method achieves quantum acceleration in polynomial time. Additionally, we observe that the generated quantum attention matrices naturally exhibit low-rank structures, providing further theoretical support for efficient modeling. Moreover, within the specific context of attention matrix computation, we conduct a systematic and detailed analysis of the error and time complexity of the proposed algorithm.
Paper Structure (39 sections, 15 theorems, 26 equations, 3 algorithms)

This paper contains 39 sections, 15 theorems, 26 equations, 3 algorithms.

Key Result

Theorem 1.3

Let $A \in \mathbb{R}^{n\times n},Q \in \mathbb{R}^{n \times d},K\in \mathbb{R}^{n\times d}$ and $D\in \mathbb{R}^{n\times n}$ be defined as in Definition def:attention_matrix. If the following conditions hold Then, there exists a quantum algorithm (implicitly) outputting a matrix $B\in \mathbb{R}^{n\times n}$ such that

Theorems & Definitions (33)

  • Definition 1.1
  • Definition 1.2
  • Theorem 1.3: Quantum algorithm for attention matrix approximation
  • Theorem 1.4: Informal version of Theorem \ref{['thm:main_result:formal']}
  • Theorem 1.5: Classical algorithm for attention matrix approximation
  • Theorem 3.1: Grover's search algorithm g96
  • Definition 4.1: Find Set
  • Definition 4.2
  • Lemma 4.3
  • Lemma 4.4
  • ...and 23 more