Table of Contents
Fetching ...

Quantum Attention by Overlap Interference: Predicting Sequences from Classical and Many-Body Quantum Data

Alessio Pecilli, Matteo Rosati

TL;DR

A variational quantum implementation of self-attention (QSA), the core operation in transformers and large language models, which predicts future elements of a sequence by forming overlap-weighted combinations of past data, is proposed and shown to have an advantage in the practical regime where the sequence length T dominates the embedding size d.

Abstract

We propose a variational quantum implementation of self-attention (QSA), the core operation in transformers and large language models, which predicts future elements of a sequence by forming overlap-weighted combinations of past data. At variance with previous approaches, our QSA realizes the required nonlinearity through interference of state overlaps and returns a Renyi-1/2 cross-entropy loss directly as the expectation value of an observable, avoiding the need to decode amplitude-encoded predictions into classical logits. Furthermore, QSA naturally accommodates a constrained, trainable data-embedding that ties quantum state overlaps to data-level similarities. We find a gate complexity dominant scaling O(T d^2) for QSA, versus O(T^2 d) classically, suggesting an advantage in the practical regime where the sequence length T dominates the embedding size d. In simulations, we show that our QSA-based quantum transformer learns sequence prediction on classical data and on many-body transverse-field Ising quantum trajectories, establishing trainable attention as a practical primitive for quantum dynamical modeling.

Quantum Attention by Overlap Interference: Predicting Sequences from Classical and Many-Body Quantum Data

TL;DR

A variational quantum implementation of self-attention (QSA), the core operation in transformers and large language models, which predicts future elements of a sequence by forming overlap-weighted combinations of past data, is proposed and shown to have an advantage in the practical regime where the sequence length T dominates the embedding size d.

Abstract

We propose a variational quantum implementation of self-attention (QSA), the core operation in transformers and large language models, which predicts future elements of a sequence by forming overlap-weighted combinations of past data. At variance with previous approaches, our QSA realizes the required nonlinearity through interference of state overlaps and returns a Renyi-1/2 cross-entropy loss directly as the expectation value of an observable, avoiding the need to decode amplitude-encoded predictions into classical logits. Furthermore, QSA naturally accommodates a constrained, trainable data-embedding that ties quantum state overlaps to data-level similarities. We find a gate complexity dominant scaling O(T d^2) for QSA, versus O(T^2 d) classically, suggesting an advantage in the practical regime where the sequence length T dominates the embedding size d. In simulations, we show that our QSA-based quantum transformer learns sequence prediction on classical data and on many-body transverse-field Ising quantum trajectories, establishing trainable attention as a practical primitive for quantum dynamical modeling.
Paper Structure (9 equations, 2 figures)

This paper contains 9 equations, 2 figures.

Figures (2)

  • Figure 1: Schematic depiction of our QSA circuit, as described in the text. $V$ and $W$ are $n$-qubit $L$-layer variational gates on each of the data registers $AB$, while $R$ are products of single-qubit variational rotations, on each qubit of the ancillary register $C$. The controlled-gates perform amplitude-encodings of the data. The measurements estimate the expectation of $(Z+\mathbbm{1})/2$ on each qubit, corresponding to the Rényi-$\frac{1}{2}$ cross-entropy loss. After training, removing the measurement of $A$, the last Hadamard gates, and post-selecting on the string $\ket{j}_C$, one obtains an amplitude-encoding of the predicted token $\ket{\tilde{{\mathbf z}}_j}_A$.
  • Figure 2: Plot of the values of the loss function ${\mathcal{L}}_{\frac{1}{2}}(p)$, up to a $\log T$ constant term, vs. trainign epochs, for the QSA, S-CSA and L-CSA described in the text, applied to two generative modelling tasks: (a) prediction of classical data sequences; (b) prediction of quantum state evolution under transverse-field Ising Hamiltonian.