Quantum Transformer: Accelerating model inference via quantum linear algebra

Naixu Guo; Zhan Yu; Matthew Choi; Yizhan Han; Aman Agrawal; Kouhei Nakaji; Alán Aspuru-Guzik; Patrick Rebentrost

Quantum Transformer: Accelerating model inference via quantum linear algebra

Naixu Guo, Zhan Yu, Matthew Choi, Yizhan Han, Aman Agrawal, Kouhei Nakaji, Alán Aspuru-Guzik, Patrick Rebentrost

TL;DR

The paper proposes a fault-tolerant quantum transformer by embedding transformer blocks into block-encoded matrices and applying quantum linear algebra via QSVT, including a novel element-wise block-encoding technique built from Hadamard products. It formalizes the quantum self-attention, residual connections with layer normalization, and GELU-based FFN as modular quantum subroutines and analyzes their runtimes under encoding factors $oldsymbol{igl(\alpha_s,\alpha_w,eta_migr)}$, showing a per-token state preparation cost of $ ilde{O}( vert QK^ op vert)$ and an overall multilayer cost of $ ilde{O}(k N^{3/2} d)$, with single-layer costs of $ ilde{O}( oot 2 hinspace N d)$. The authors support their theoretical claims with numerical analyses of input and weight norms on real models and genomic datasets, arguing that the predicted speedups are plausible in practical regimes, particularly under QRAM or favorable data-normalization assumptions. They discuss dequantization risks, propose a quantum-friendly transformer paradigm, and outline extensions to training and more general architectures, highlighting potential for polynomial—and in special regimes exponential—speedups for transformer inference on fault-tolerant quantum hardware.

Abstract

Powerful generative artificial intelligence from large language models (LLMs) harnesses extensive computational resources for inference. In this work, we investigate the transformer architecture, a key component of these models, under the lens of fault-tolerant quantum computing. We develop quantum subroutines to construct the building blocks in the transformer, including the self-attention, residual connection with layer normalization, and feed-forward network. As an important subroutine, we show how to efficiently implement the Hadamard product and element-wise functions of matrices on quantum computers. Our algorithm prepares an amplitude encoding of the transformer output, which can be measured for prediction or use in the next layer. We find that the matrix norm of the input sequence plays a dominant role in the quantum complexity. With numerical experiments on open-source LLMs, including for bio-informatics applications, we demonstrate the potential of a quantum speedup for transformer inference in practical regimes.

Quantum Transformer: Accelerating model inference via quantum linear algebra

TL;DR

, showing a per-token state preparation cost of

and an overall multilayer cost of

, with single-layer costs of

. The authors support their theoretical claims with numerical analyses of input and weight norms on real models and genomic datasets, arguing that the predicted speedups are plausible in practical regimes, particularly under QRAM or favorable data-normalization assumptions. They discuss dequantization risks, propose a quantum-friendly transformer paradigm, and outline extensions to training and more general architectures, highlighting potential for polynomial—and in special regimes exponential—speedups for transformer inference on fault-tolerant quantum hardware.

Abstract

Paper Structure (41 sections, 36 theorems, 94 equations, 7 figures, 6 tables)

This paper contains 41 sections, 36 theorems, 94 equations, 7 figures, 6 tables.

Introduction
Results
Quantum linear algebra
Quantum transformer architecture
Numerical analysis
Runtime and speedup
Discussion
Methods
Proof sketch of \ref{['thm.element-wise']}
Quantum self attention
Quantum feed-forward network with GELU function
Numerical details
Quantum multilayer transformer
Robustness to dequantization
Preliminary
...and 26 more sections

Key Result

Theorem 1

For a transformer with embedding dimension $d$ and an input sequence $S$ of length $N$, given access to the sequence matrix and weight matrices via block-encodings, for the index $j\in [N]$, one can construct a quantum circuit that prepares the state up to error $\epsilon$ by using ${\mathcal{\widetilde{O}}}(\sqrt{N} d \log^2(1/\epsilon))$ times of the input block encodings.

Figures (7)

Figure S1: Overview of the quantum transformer architecture. Same as the original decoder-only transformer architecture, the quantum transformer consists of a self-attention and a feed-forward network sub-layer, incorporating residual connection with layer normalization. The inputs of the quantum transformer are block encodings of the input sequence and pre-trained weight matrices, from which the relevant matrices for the transformer are constructed (query $Q$, key $K$, and value $V$). Given the input block encodings, we construct the corresponding quantum subroutines and combine them to our final result on obtaining the classical output vector corresponding to the $j$-th token. multilayer architecture can be achieved by iterating the procedure for each token $j\in [N]$ and producing a new block encoding of input sequence for the next layer.
Figure S2: Scaling of the spectral norm $\|S\|$ and the Frobenius norm $\|S\|_{F}$ with $N$ for each model, displayed on logarithmic scales for both axes. For reference, the line $y \propto \sqrt{x}$ is also shown. We use tokens in MMLU dataset and convert them to $S$. The embedding dimension $d$ is $768$ for BERT devlin2019bert, RoBERTa liu2019roberta, GPT radford2018improve, DistilGPT sanh2019distilbert and GPT2 radford2019language; $2048$ for TinyLlama zhang2024tinyllamaopensourcesmalllanguage; and $4096$ for both Llama2-7B touvron2023llama2openfoundation and Mistral-7B jiang2023mistral7b.
Figure S3: Overview of the quantum transformer architecture.
Figure S4: Scaling of the spectral norm $\|S\|$ and the Frobenius norm $\|S\|_{F}$ with $N$ for each model, displayed on logarithmic scales for both axes. For reference, the line $y \propto \sqrt{x}$ is also shown. We randomly generate tokens and convert them to $S$.
Figure S5: Scaling of the spectral norm $\|S\|$ and the Frobenius norm $\|S\|_{F}$ with $N$ for each model, displayed on logarithmic scales for both axes. For reference, the line $y \propto \sqrt{x}$ is also shown. We use tokens in the MMLU dataset and convert them to $S$.
...and 2 more figures

Theorems & Definitions (60)

Theorem 1: Quantum transformer, informal
Theorem 2: Element-wise function of block encodings, informal
Definition 1: Block encoding chakraborty2019powergilyen2019quantum
Definition 2: State preparation encoding
Theorem 3: Quantum state preparation sun2023asymptotically
Definition 3: State preparation pair chakraborty2019powergilyen2019quantum
Lemma 1: Linear combination of block-encoded matrices chakraborty2019powergilyen2019quantum
Lemma 2: Product of block-encoded matrices chakraborty2019powergilyen2019quantum
Theorem 4: Polynomial eigenvalue transformation gilyen2019quantum
Definition 4: Input assumption
...and 50 more

Quantum Transformer: Accelerating model inference via quantum linear algebra

TL;DR

Abstract

Quantum Transformer: Accelerating model inference via quantum linear algebra

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (60)