Quantum Transformer: Accelerating model inference via quantum linear algebra
Naixu Guo, Zhan Yu, Matthew Choi, Yizhan Han, Aman Agrawal, Kouhei Nakaji, Alán Aspuru-Guzik, Patrick Rebentrost
TL;DR
The paper proposes a fault-tolerant quantum transformer by embedding transformer blocks into block-encoded matrices and applying quantum linear algebra via QSVT, including a novel element-wise block-encoding technique built from Hadamard products. It formalizes the quantum self-attention, residual connections with layer normalization, and GELU-based FFN as modular quantum subroutines and analyzes their runtimes under encoding factors $oldsymbol{igl(\alpha_s,\alpha_w,eta_migr)}$, showing a per-token state preparation cost of $ ilde{O}( vert QK^ op vert)$ and an overall multilayer cost of $ ilde{O}(k N^{3/2} d)$, with single-layer costs of $ ilde{O}( oot 2 hinspace N d)$. The authors support their theoretical claims with numerical analyses of input and weight norms on real models and genomic datasets, arguing that the predicted speedups are plausible in practical regimes, particularly under QRAM or favorable data-normalization assumptions. They discuss dequantization risks, propose a quantum-friendly transformer paradigm, and outline extensions to training and more general architectures, highlighting potential for polynomial—and in special regimes exponential—speedups for transformer inference on fault-tolerant quantum hardware.
Abstract
Powerful generative artificial intelligence from large language models (LLMs) harnesses extensive computational resources for inference. In this work, we investigate the transformer architecture, a key component of these models, under the lens of fault-tolerant quantum computing. We develop quantum subroutines to construct the building blocks in the transformer, including the self-attention, residual connection with layer normalization, and feed-forward network. As an important subroutine, we show how to efficiently implement the Hadamard product and element-wise functions of matrices on quantum computers. Our algorithm prepares an amplitude encoding of the transformer output, which can be measured for prediction or use in the next layer. We find that the matrix norm of the input sequence plays a dominant role in the quantum complexity. With numerical experiments on open-source LLMs, including for bio-informatics applications, we demonstrate the potential of a quantum speedup for transformer inference in practical regimes.
