Training-efficient density quantum machine learning

Brian Coyle; Snehal Raj; Natansh Mathur; El Amine Cherrat; Nishant Jain; Skander Kazdaghli; Iordanis Kerenidis

Training-efficient density quantum machine learning

Brian Coyle, Snehal Raj, Natansh Mathur, El Amine Cherrat, Nishant Jain, Skander Kazdaghli, Iordanis Kerenidis

TL;DR

This work introduces density quantum neural networks (density QNNs), a framework that forms a density state $\rho(\boldsymbol{\theta},\boldsymbol{\alpha},\boldsymbol{x}) = \sum_{k=1}^K \alpha_k U_k(\boldsymbol{\theta}_k)\rho(\boldsymbol{x})U_k^{\dagger}(\boldsymbol{\theta}_k)$, enabling expressive yet trainable quantum models on depth-limited hardware. By incorporating data-dependent coefficients and two preparation modes (deterministic and randomized), the authors connect density QNNs to LCU and the Hastings-Campbell Mixing lemma, showing that randomised density QNNs can retain LCU benefits with shallower circuits. The paper formally relates density QNNs to classical mechanisms (dropout, mixture-of-experts) and to other QML frameworks (kernel methods, data reuploading, post-variational models), and provides theoretical gradient-extraction bounds and efficient training strategies via commuting-block circuits. Numerical experiments across equivariant XX/YY, orthogonal HW-preserving, and data reuploading density QNNs demonstrate improved trainability, robustness to overfitting, and practical potential for scalable quantum learning on near-term devices.

Abstract

Quantum machine learning (QML) requires powerful, flexible and efficiently trainable models to be successful in solving challenging problems. We introduce density quantum neural networks, a model family that prepares mixtures of trainable unitaries, with a distributional constraint over coefficients. This framework balances expressivity and efficient trainability, especially on quantum hardware. For expressivity, the Hastings-Campbell Mixing lemma converts benefits from linear combination of unitaries into density models with similar performance guarantees but shallower circuits. For trainability, commuting-generator circuits enable density model construction with efficiently extractable gradients. The framework connects to various facets of QML including post-variational and measurement-based learning. In classical settings, density models naturally integrate the mixture of experts formalism, and offer natural overfitting mitigation. The framework is versatile - we uplift several quantum models into density versions to improve model performance, or trainability, or both. These include Hamming weight-preserving and equivariant models, among others. Extensive numerical experiments validate our findings.

Training-efficient density quantum machine learning

TL;DR

This work introduces density quantum neural networks (density QNNs), a framework that forms a density state

, enabling expressive yet trainable quantum models on depth-limited hardware. By incorporating data-dependent coefficients and two preparation modes (deterministic and randomized), the authors connect density QNNs to LCU and the Hastings-Campbell Mixing lemma, showing that randomised density QNNs can retain LCU benefits with shallower circuits. The paper formally relates density QNNs to classical mechanisms (dropout, mixture-of-experts) and to other QML frameworks (kernel methods, data reuploading, post-variational models), and provides theoretical gradient-extraction bounds and efficient training strategies via commuting-block circuits. Numerical experiments across equivariant XX/YY, orthogonal HW-preserving, and data reuploading density QNNs demonstrate improved trainability, robustness to overfitting, and practical potential for scalable quantum learning on near-term devices.

Abstract

Paper Structure (59 sections, 7 theorems, 65 equations, 16 figures, 1 table)

This paper contains 59 sections, 7 theorems, 65 equations, 16 figures, 1 table.

Introduction
Results
Density quantum neural networks
Connection to other QML frameworks
Preparing density quantum neural networks
Gradient extraction for density QNNs
Efficiently trainable density networks
Hardware efficient quantum neural networks
LCU and the Mixing lemma
Connection to classical mechanisms
Quantum dropout
Density quantum neural networks as a mixture of experts
Numerical results
Equivariant quantum neural networks
Orthogonal quantum neural networks
...and 44 more sections

Key Result

Proposition 1

Given a density QNN as in equation (eqn:density_qnns) composed of $K$ sub-unitaries, $\mathcal{U} = \{U_k(\boldsymbol{\theta}_k)\}_{k=1}^K$, implemented with distribution, $\boldsymbol{\alpha} = \{\alpha_k\}$, an unbiased estimator of the gradients of a loss function, $\mathcal{L}$, defined by a Her can be computed by classically post-processing $\sum_{l=1}^K\sum_{k=1}^K T_{\ell k}$ circuits, wher

Figures (16)

Figure 1: Density quantum neural networks. a) Linear combination of unitaries quantum neural networks (LCU QNNs) preparing the state $\sum_k\alpha_k U_k(\boldsymbol{\theta}_k)\ket{\boldsymbol{x}}$ via postselection on an ancilla register $\mathcal{A}$ which prepares the distribution $\boldsymbol{\alpha}$. b) shows corresponding density quantum neural network, implemented deterministically to prepare the state $\rho(\boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{x})$. Finally, the instantiation of the density QNN state via randomisation is shown in (c) where sub-unitary, $U_k(\boldsymbol{\theta}_k)$ is only prepared with the probability $\alpha_k$ without the need for the multi controlled deep circuits and ancilla qubits. The deterministic density QNN, (b) is required if one wishes to make a true comparison of these networks to the dropout mechanism. From the Mixing lemma, the randomised version, (c), can distill the performance benefits of the more powerful LCU QNN, (a), into very short depth circuits. The probability loaders, $\mathsf{Load}\left(\sqrt{\boldsymbol{\alpha}}\right)$ are assumed to be unary data loaders which act on $K$ qubits within the register, $\mathcal{A}$ and have depth $\log(K)$johri_nearest_2021. One could also use binary Prepare and Select circuits acting on $\log(K)$ qubits as is more standard in LCU literature. The resulting functions from each network, $f(\boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{x})$ result from the measurement of an observable, $\mathcal{O}$, via $f(\boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{x}) = \Tr(\mathcal{O}\sigma(\boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{x}))$, where $\sigma$ is the output state from each circuit. $V(\boldsymbol{x})$ is the $n$-qubit data loader acting on register, $\mathcal{B}$.
Figure 2: Illustration of Corollary \ref{['corr:density_qnn_gradient_independent']}. In the case where no parameters are shared across the sub-unitaries, the gradients of the density model in equation (\ref{['eqn:density_qnns']}) when measured with an observable $\mathcal{H}$ simply involves computing gradients for each sub-unitary individually. As a result, the full model introduces an $\mathcal{O}(K)$ overhead for gradient extraction. If $K=\mathcal{O}(\log(N))$ and each sub-unitary admits a backpropagation scaling for gradient extraction, the density model will also admit a backpropagation scaling.
Figure 3: Decomposing a hardware efficient ansatz for a density QNN.$D$ layers of a hardware efficient (HWE) ansatz with entanglement generated by CNOT ladders and trainable parameters in single qubits $R_x, R_y, R_z$ gates. (bottom left) $D$ layers extracted into $D$ sub-unitaries with probabilities, $\{\alpha_d\}_{d=1}^D$ for a density QNN version. Applying the commuting-generator framework to the density version, $\rho^{\mathsf{HWE}}(\boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{x})$, enables parallel gradient evaluation in $2D$ circuits versus $2nD$ as required by the pure state version, $\ket{\psi^{\mathsf{HWE}}(\boldsymbol{\theta}, \boldsymbol{\alpha}, \boldsymbol{x})}$. TO illustrate potential differences between sub-unitaries, we arbitrarily reverse CNOT directions in subsequent layers and partially accounting for low circuit depth. (bottom right) Alternatively, we can simply create a more expressive version of the hardware efficient QNN within the density framework by duplicating across $K$ sub-unitaries with probabilities $\{\alpha_k\}_{k=1}^K$ retaining $D$ layers each. In this case, the model requires $2nDK$ circuits for gradient extraction, but each sub-unitary can have independent parameters learning different features, especially if each contains different entanglement structures.
Figure 4: Equivariant density QNNs. a) The commuting-generator XX model and a b) XX+YY density QNN model. The former contains up to three-body Pauli-X generated operations with twirling applied to enforce equivariance. The latter contains two sub-unitaries $U_{XX}$ (circuit (a)) and $U_{YY}$ which has the same structure but replacing Pauli-$X$ operations with Pauli-$Y$. $U_{XX}/U_{YY}$ are applied with probabilities $\alpha_{XX/YY}$. Each sub-unitary in b) are commuting-generator circuits, so each has efficiently extractable gradients.
Figure 5: Numerics on noisy bars and dots dataset. We create density QNN versions of the non-trivial models from Ref. bowles_backpropagation_2023; 1) the commuting-generator circuit in Fig. \ref{['fig:equivariant_models']}, 2) a 'non-commuting' equivariant QNN and 3) the quantum convolutional neural network cong_quantum_2019, all on $10$ qubits. In all cases the density QNN is initialised from (separately) pretrained (for $1000$ epochs) base versions, and training continues for another $1000$. For the non-commuting and QCNN density models the sub-unitaries have identical structure but trained independently. We show mean and standard deviation in test accuracy vs. a) training epochs and b) number of overall shots over $5$ independent training runs, from the same initialisation. In all cases, after base model performance saturation, the density QNN improves the final result. The gaps in (b) are to account for the extra measurement overhead to initialise and train the second sub-unitaries. This is $U_{YY}$ for the density model and $U_2$ (non-comm-2/QCNN-2) for the other two models. In all cases, the density version is initialised with $\alpha_1/\alpha_{XX} = 0.99$$\alpha_2/\alpha_{YY} = 0.01$ which are also trainable.
...and 11 more figures

Theorems & Definitions (12)

Proposition 1: Gradient scaling for density quantum neural networks
Corollary 1
Definition 1: Backpropagation scaling abbas_quantum_2023bowles_backpropagation_2023
Corollary 2: Gradient scaling for density commuting-block quantum neural networks
proof
Lemma 1: Mixing lemma for supervised learning
Proposition : Gradient scaling for density quantum neural networks (Proposition \ref{['prop:gradient_scaling_dqnn']} repeated)
proof
Lemma : Mixing lemma for supervised learning (Lemma \ref{['lemma:supervised_mixing_corr']} repeated)
proof
...and 2 more

Training-efficient density quantum machine learning

TL;DR

Abstract

Training-efficient density quantum machine learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (12)