Table of Contents
Fetching ...

Guided Quantum Compression for High Dimensional Data Classification

Vasilis Belis, Patrick Odagiu, Michele Grossi, Florentin Reiter, Günther Dissertori, Sofia Vallecorsa

TL;DR

The paper addresses the challenge of applying quantum machine learning to high-dimensional LHC data by introducing Guided Quantum Compression (GQC), a hybrid architecture that simultaneously learns a low-dimensional latent representation and a quantum classifier. By coupling an auto-encoder with a variational quantum circuit and training them with a joint loss, GQC preserves discriminative structure that separate preprocessing can destroy. On simulated $t\bar{t}H(b\bar{b})$ data, GQC outperforms the conventional 2Step approach and is competitive with classical baselines, with markedly better latent-space separability when using limited feature sets. This work broadens the practical applicability of QML to realistic physics datasets and provides public data and code to foster further development.

Abstract

Quantum machine learning provides a fundamentally different approach to analyzing data. However, many interesting datasets are too complex for currently available quantum computers. Present quantum machine learning applications usually diminish this complexity by reducing the dimensionality of the data, e.g., via auto-encoders, before passing it through the quantum models. Here, we design a classical-quantum paradigm that unifies the dimensionality reduction task with a quantum classification model into a single architecture: the guided quantum compression model. We exemplify how this architecture outperforms conventional quantum machine learning approaches on a challenging binary classification problem: identifying the Higgs boson in proton-proton collisions at the LHC. Furthermore, the guided quantum compression model shows better performance compared to the deep learning benchmark when using solely the kinematic variables in our dataset.

Guided Quantum Compression for High Dimensional Data Classification

TL;DR

The paper addresses the challenge of applying quantum machine learning to high-dimensional LHC data by introducing Guided Quantum Compression (GQC), a hybrid architecture that simultaneously learns a low-dimensional latent representation and a quantum classifier. By coupling an auto-encoder with a variational quantum circuit and training them with a joint loss, GQC preserves discriminative structure that separate preprocessing can destroy. On simulated data, GQC outperforms the conventional 2Step approach and is competitive with classical baselines, with markedly better latent-space separability when using limited feature sets. This work broadens the practical applicability of QML to realistic physics datasets and provides public data and code to foster further development.

Abstract

Quantum machine learning provides a fundamentally different approach to analyzing data. However, many interesting datasets are too complex for currently available quantum computers. Present quantum machine learning applications usually diminish this complexity by reducing the dimensionality of the data, e.g., via auto-encoders, before passing it through the quantum models. Here, we design a classical-quantum paradigm that unifies the dimensionality reduction task with a quantum classification model into a single architecture: the guided quantum compression model. We exemplify how this architecture outperforms conventional quantum machine learning approaches on a challenging binary classification problem: identifying the Higgs boson in proton-proton collisions at the LHC. Furthermore, the guided quantum compression model shows better performance compared to the deep learning benchmark when using solely the kinematic variables in our dataset.
Paper Structure (12 sections, 10 equations, 3 figures, 2 tables)

This paper contains 12 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The GQC network. The architecture of the Guided Quantum Compression (GQC) network is shown in a) and b). The auto-encoder receives data from simulated LHC proton-proton collisions and produces a lower dimensional representation $z\in\mathbb{R}^\ell$ via the encoder network $\mathcal{E_\omega}$, where $\ell$ is called the latent space dimension. The decoder network $\mathcal{D_\rho}$, receives $z$ and aims to reconstruct the original data $x$. The distinct segments $z_1,\dots,\, z_i, \dots,\, z_d$ of the latent space vector $z$ are encoded sequentially in the quantum circuit by using the feature map $U(\cdot)$; the dimension of $z_i$ is equal to the number of qubits $n$ in the circuit. The trainable gates $G(\cdot)$ are placed between the quantum encoding gates $U(\cdot)$. The output of the decoder network and quantum model are used to minimize different parts of the total loss function $\mathcal{L}$ from Eq. \ref{['eq:gq_loss']}. c) The data encoding circuit $U(\cdot)$Havlicek2019 described in Sec. \ref{['sec:vqc']}. d) The variational ansatz $G(\cdot)$. The indices $j=1, \dots,\, n$ enumerate the elements of $z_i$. Moreover, the indices $l=1,\dots,\,2nr$ pertain to the trainable parameter of the corresponding $k$-th parametrised circuit block $G(\vartheta_k)$, where $r$ are the repetitions of the trainable ansatz.
  • Figure 2: Latent representation. The one- and two-dimensional projections of the $t\bar{t}H(b\bar{b})$ dataset latent space $z\in\mathbb{R}^\ell$ generated by a) the 2Step training paradigm and b) GQC model. The probability distributions of the latent features $z_7$ and $z_{10}$ are shown in the histogram plots. The joint two-dimensional probability distributions $\mathcal{P}(z_7, z_{10})$ are displayed in the density plots. Notice that the latent space separation of signal and background is better in the GQC algorithm; furthermore, the GQC latent distributions are more regularly shaped. The latent features $z_7$ and $z_{10}$ are arbitrarily chosen to show the structure of the latent vector $z$ in one or two dimensions. These joint distributions are symmetric: $\mathcal{P}(z_7,z_{10})=\mathcal{P}(z_{10},z_{7}).$
  • Figure 3: Receiver Operating Curves. The ROC curves of the models with a) the btag features and b) without the btag features included in the training data. The 2Step training procedure yields the worst performance. At the bottom panel of each plot the difference between the GQC ROC and the classical ROC is displayed. When the btag is absent from the dataset, the GQC model outperforms the classical benchmark in the TPR range of $0.4$ to $0.9$, as shown in the lower panel of b).