Table of Contents
Fetching ...

Encoding molecular structures in quantum machine learning

Choy Boy, Edoardo Altamura, Dilhan Manawadu, Ivano Tavernelli, Stefano Mensa, David J. Wales

TL;DR

QMSE introduces a bond-order–aware encoding of molecular structures into quantum circuits via a hybrid Coulomb–adjacency matrix, addressing limitations of fingerprint encodings in state separability and trainability. By mapping the matrix to one- and two-qubit rotations, QMSE yields more expressive and interpretable feature maps, with a fidelity-preserving chain-contraction theorem enabling qubit reductions for long chains. Benchmarking on 105 small molecules shows QMSE outperforms angle-based fingerprint encoding in classification and regression tasks, and experiments demonstrate robustness to hardware noise and favorable training dynamics. The work suggests QMSE as a practical pathway toward scalable quantum-assisted modelling of chemical data and a bridge to future FTQC-enabled graph-state encodings and kernel methods.

Abstract

Quantum machine learning (QML) has great potential for the analysis of chemical datasets. However, conventional quantum data-encoding schemes, such as fingerprint encoding, are generally unfeasible for the accurate representation of chemical moieties in such datasets. In this contribution, we introduce the quantum molecular structure encoding (QMSE) scheme, which encodes the molecular bond orders and interatomic couplings expressed as a hybrid Coulomb-adjacency matrix, directly as one- and two-qubit rotations within parameterised circuits. We show that this strategy provides an efficient and interpretable method in improving state separability between encoded molecules compared to other fingerprint encoding methods, which is especially crucial for the success in preparing feature maps in QML workflows. To benchmark our method, we train a parameterised ansatz on molecular datasets to perform classification of state phases and regression on boiling points, demonstrating the competitive trainability and generalisation capabilities of QMSE. We further prove a fidelity-preserving chain-contraction theorem that reuses common substructures to cut qubit counts, with an application to long-chain fatty acids. We expect this scalable and interpretable encoding framework to greatly pave the way for practical QML applications of molecular datasets.

Encoding molecular structures in quantum machine learning

TL;DR

QMSE introduces a bond-order–aware encoding of molecular structures into quantum circuits via a hybrid Coulomb–adjacency matrix, addressing limitations of fingerprint encodings in state separability and trainability. By mapping the matrix to one- and two-qubit rotations, QMSE yields more expressive and interpretable feature maps, with a fidelity-preserving chain-contraction theorem enabling qubit reductions for long chains. Benchmarking on 105 small molecules shows QMSE outperforms angle-based fingerprint encoding in classification and regression tasks, and experiments demonstrate robustness to hardware noise and favorable training dynamics. The work suggests QMSE as a practical pathway toward scalable quantum-assisted modelling of chemical data and a bridge to future FTQC-enabled graph-state encodings and kernel methods.

Abstract

Quantum machine learning (QML) has great potential for the analysis of chemical datasets. However, conventional quantum data-encoding schemes, such as fingerprint encoding, are generally unfeasible for the accurate representation of chemical moieties in such datasets. In this contribution, we introduce the quantum molecular structure encoding (QMSE) scheme, which encodes the molecular bond orders and interatomic couplings expressed as a hybrid Coulomb-adjacency matrix, directly as one- and two-qubit rotations within parameterised circuits. We show that this strategy provides an efficient and interpretable method in improving state separability between encoded molecules compared to other fingerprint encoding methods, which is especially crucial for the success in preparing feature maps in QML workflows. To benchmark our method, we train a parameterised ansatz on molecular datasets to perform classification of state phases and regression on boiling points, demonstrating the competitive trainability and generalisation capabilities of QMSE. We further prove a fidelity-preserving chain-contraction theorem that reuses common substructures to cut qubit counts, with an application to long-chain fatty acids. We expect this scalable and interpretable encoding framework to greatly pave the way for practical QML applications of molecular datasets.

Paper Structure

This paper contains 15 sections, 2 theorems, 9 equations, 10 figures, 2 tables.

Key Result

Theorem 1

There exists a bijection $\phi:\mathcal{S}\rightarrow\mathcal{Q}$ between the set of SMILES strings $\mathcal{S}$ and the set of QMSE unitaries $\mathcal{Q}$ for a given $L_\mathbf{x}$ and $N$.

Figures (10)

  • Figure 1: Schematic of the variational QML workflow for two encoding strategies. a) Fingerprint (angle) encoding: compressed molecular fingerprints are loaded onto the data-encoding layer (in green) as angular rotations. After subsequent evolution of a parameterised ansatz with unitary operator $U(\bm{\theta})$, the circuit's expectation values are evaluated by measuring an observable $\hat{{H}}$, and the resulting cost function $C(\bm{\theta})$ is fed back to a classical optimiser to update the parameters of the ansatz until either the maximum number of iterations or the convergence criterion has been reached. b) Quantum molecular structure encoding: SMILES strings are instead first converted into a hybrid Coulomb‐adjacency matrix and encoded into the quantum circuit by a dedicated data‐encoding layer; the ansatz, measurement of $\hat{{H}}$, and classical parameter‐update loop are then applied.
  • Figure 2: Example molecular encoding layer (bottom) forming part of a 4-qubit quantum circuit representing an (E)-but-2-ene molecule (top) using $R_y$ and $R_{xx}$ gates with $L_{\mathbf{x}}$ number of data-encoding layers. The rotational angles associated with the gates are tuned based on the elements of the hybrid Coulomb-adjacency matrix in Eq. \ref{['eq:main-matrix']}.
  • Figure 3: Tanimoto similarity (left) and QMSE fidelity (right) of the fatty acids series FA1$-$FA7. The chemical overlap is computed from SMILES strings encoded via QMSE, using the default $R_y + R_{xx}$ combination. For each overlap pair, the number of qubits of the unitary circuits is also reported after chain contraction.
  • Figure 4: Heatmaps of fidelity matrices for chain-contracted molecules encoded with QMSE within the alkane- (top row) and oxygen- (bottom row) subdatasets. The one-qubit data-encoding gate for all configurations is fixed as $R_y$, while the two-qubit data-encoding gate is varied to produce the fidelity matrix for $R_{xx}$ (left column), $R_{yy}$ (middle column), and $R_{zz}$ (right column). The colour corresponds to the fidelity of each molecule pair within their respective insets. Based on the overall distribution of the fidelity values, $R_{y}+ R_{xx}$ is selected as the default setup for QMSE.
  • Figure 5: Accuracy scores when classifying molecules in the alkane (left, a and b) and complete (right, c and d) datasets for Runs 1$-$9 with 1$-$5 $L_{\theta}$ number of ansatz layers. The top row shows the median training accuracy scores, and the bottom row shows the median test accuracy scores. The error bars indicate the 16th and 84th percentile values of the average accuracies of the $k$-fold splits.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem
  • proof
  • proof
  • Corollary