Table of Contents
Fetching ...

Quantum Doubly Stochastic Transformers

Jannis Born, Filip Skogh, Kahn Rhrissorrakrai, Filippo Utro, Nico Wagner, Aleksandros Sobczyk

TL;DR

This work addresses instability and entropy issues in Transformer attention by introducing a parametric quantum circuit that outputs doubly stochastic attention matrices, formulating the Quantum Doubly Stochastic Transformer (QDSFormer). By replacing Softmax with DSM-producing operators (notably the variational circuit QontOT) and comparing against Sinkhorn-based and QR-inspired methods, the authors demonstrate improved expressivity, information preservation, and training stability on small-scale vision tasks. Key findings include higher DSM diversity with QontOT, competitive accuracy gains over ViT and Sinkformer across MNIST-family datasets and MedMNIST, faster Eureka moments in compositional reasoning, and resilience to training challenges, albeit with scaling and hardware-noise considerations. The work highlights a promising hybrid quantum-classical pathway for inductive biases in transformers and motivates further hardware-oriented development and theoretical analysis of DSMs in neural networks.

Abstract

At the core of the Transformer, the softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often de-stabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard ViT and other doubly stochastic Transformers. Beyond the Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. Our QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

Quantum Doubly Stochastic Transformers

TL;DR

This work addresses instability and entropy issues in Transformer attention by introducing a parametric quantum circuit that outputs doubly stochastic attention matrices, formulating the Quantum Doubly Stochastic Transformer (QDSFormer). By replacing Softmax with DSM-producing operators (notably the variational circuit QontOT) and comparing against Sinkhorn-based and QR-inspired methods, the authors demonstrate improved expressivity, information preservation, and training stability on small-scale vision tasks. Key findings include higher DSM diversity with QontOT, competitive accuracy gains over ViT and Sinkformer across MNIST-family datasets and MedMNIST, faster Eureka moments in compositional reasoning, and resilience to training challenges, albeit with scaling and hardware-noise considerations. The work highlights a promising hybrid quantum-classical pathway for inductive biases in transformers and motivates further hardware-oriented development and theoretical analysis of DSMs in neural networks.

Abstract

At the core of the Transformer, the softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often de-stabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn's algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn's algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard ViT and other doubly stochastic Transformers. Beyond the Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. Our QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.

Paper Structure

This paper contains 32 sections, 1 theorem, 18 equations, 14 figures, 7 tables.

Key Result

Theorem D.1

For any $s,k\in \mathbb{N}$, the number of $k$-tuples $(x_0, ..., x_k)$ where $x_k\in \mathbb{N}_0$ with sum $s$ is equal to the number of multisets of cardinality $s$ taken from a set of size $k$:

Figures (14)

  • Figure 1: Doubly Stochastic Transformers. Standard scaled dot-product attention applies a Softmax activation on the query-key matrix (top). We study different techniques to make attention doubly stochastic attention by replacing the softmax operation (bottom). Our proposed Quantum Doubly Stochastic Transformer (QDSFormer) leverages QontOT, a variational quantum circuit with high expressivity.
  • Figure 2: Left: Number of unique DSMs obtained after exhaustively iterating over a discretized unit hypercube. With only $8$ layers, QontOT produces a unique DSM for every possible input, unlike all other methods. Right: Entropy vs. distance-preservation tradeoff. Shannon entropy of different doubly stochastic attention against the Frobenius norm of the difference between unnormalized attention $\boldsymbol{\mathrm{Q}}\boldsymbol{\mathrm{K}}$ and the obtained DSM $\boldsymbol{\mathrm{P}}$.
  • Figure 3: Comparison of ViT and QDSFormer while varying the circuit depth. Mean/std from 5 trainings are shown. Within (a) and (b) all models use the same number of trainable parameters.
  • Figure 4: (a) Eureka results across attention methods. (b) QDSFormer antedates the Eureka Moment (EM).
  • Figure A1: Hardware experiment on different quantum computers available via IBM Quantum Platform.A The Frobenius distance between the hardware-obtained DSM to its noise-free equivalent. B: The spearman rank correlation between the $64$ values in the noise-free and hardware-obtained DSMs show that the ordering of values is largely preserved. Statevectorsampler here denotes finite sampling from an ideal, noise-free statevector.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem D.1