Quantum Vision Transformers

El Amine Cherrat; Iordanis Kerenidis; Natansh Mathur; Jonas Landman; Martin Strahm; Yun Yvonna Li

Quantum Vision Transformers

El Amine Cherrat, Iordanis Kerenidis, Natansh Mathur, Jonas Landman, Martin Strahm, Yun Yvonna Li

TL;DR

This work investigates quantum vision transformers by introducing three architectures—Orthogonal Patch-wise Transformer, Quantum Orthogonal Transformer, and Quantum Compound Transformer—that replace or augment the classical attention mechanism with quantum data loaders and orthogonal/compound-layer circuits. It provides concrete circuit designs (Pyramid, Butterfly, X) for quantum linear algebra, a matrix-data-loading scheme, and a second-order compound-matrix approach to enable native quantum attention. Across simulations on MedMNIST and limited- qubit quantum hardware, the quantum transformers achieve competitive accuracy with far fewer trainable parameters than classical Vision Transformers, suggesting potential runtime and resource-efficiency advantages. The results remain preliminary due to hardware noise and scale limits, but indicate a promising direction for scalable quantum-enhanced vision models on near-term devices.

Abstract

In this work, quantum transformers are designed and analysed in detail by extending the state-of-the-art classical transformer neural network architectures known to be very performant in natural language processing and image analysis. Building upon the previous work, which uses parametrised quantum circuits for data loading and orthogonal neural layers, we introduce three types of quantum transformers for training and inference, including a quantum transformer based on compound matrices, which guarantees a theoretical advantage of the quantum attention mechanism compared to their classical counterpart both in terms of asymptotic run time and the number of model parameters. These quantum architectures can be built using shallow quantum circuits and produce qualitatively different classification models. The three proposed quantum attention layers vary on the spectrum between closely following the classical transformers and exhibiting more quantum characteristics. As building blocks of the quantum transformer, we propose a novel method for loading a matrix as quantum states as well as two new trainable quantum orthogonal layers adaptable to different levels of connectivity and quality of quantum computers. We performed extensive simulations of the quantum transformers on standard medical image datasets that showed competitively, and at times better performance compared to the classical benchmarks, including the best-in-class classical vision transformers. The quantum transformers we trained on these small-scale datasets require fewer parameters compared to standard classical benchmarks. Finally, we implemented our quantum transformers on superconducting quantum computers and obtained encouraging results for up to six qubit experiments.

Quantum Vision Transformers

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 16 figures, 6 tables)

This paper contains 26 sections, 7 equations, 16 figures, 6 tables.

Introduction
Quantum Tools
Quantum Data Loaders for Matrices
Quantum Orthogonal Layers
Quantum Transformers
Orthogonal Patch-wise Neural Network
Quantum Orthogonal Transformer
Direct Quantum Attention
Quantum Compound Transformer
Experiments
Simulation Setting
Simulation Results
Conclusion
Vision Transformers
Quantum Tools (Extended)
...and 11 more sections

Figures (16)

Figure 1: Vision Transformer Overview
Figure 2: Patch Division Preprocessing
Figure 3: Transformer layer
Figure 4: Attention Mechanism
Figure 5: Data loader circuit for a matrix $X\in\mathbb{R}^{n\times d}$. The top register uses $N$ qubits and the vector data loader to load the norms of each row, $(\left\lVert\mathbf{x}_1\right\rVert,\cdots,\left\lVert\mathbf{x}_n\right\rVert)$, to obtain the state $\frac{1}{\left\lVert\mathbf{X}\right\rVert}\sum_{i=1}^n \left\lVert\mathbf{x}_i\right\rVert\mathinner{|{\mathbf{e}_i}\rangle}$. The lower register uses $d$ qubits to load each row $\mathbf{x}_i \in \mathbb{R}^d$ sequentially, by applying the vector loader and their adjoint for each row $\mathbf{x}_i$, with CNOTs controlled by the corresponding qubit $i$ of the top register. Each loader on the lower register has depth $\mathcal{O}(\log d)$. s
...and 11 more figures

Quantum Vision Transformers

TL;DR

Abstract

Quantum Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (16)