Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

Yun-Wen Mao; Roman V. Krems

Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

Yun-Wen Mao, Roman V. Krems

TL;DR

The study tackles predicting molecular properties across vast chemical compound space with scarce labeled data. It introduces nine-dimensional descriptors derived from the distributions of Coulomb-matrix eigenvalues and Gershgorin-circle features, supplemented by Graph Convolution Network outputs, and uses Gaussian process regression with variable kernels selected by a greedy, BIC-guided process. With training sets as small as 100 molecules, the models achieve chemical accuracy for $S \times T$ (entropy×temperature) and $ZPVE$ across up to 20,000 molecules, and Gershgorin-based descriptors outperform GNN-based descriptors for entropy. This data-efficient framework enables Bayesian optimization in chemical space and reduces reliance on large labeled datasets or expensive quantum calculations, with broad implications for rapid molecular design.

Abstract

We demonstrate accurate data-starved models of molecular properties for interpolation in chemical compound spaces with low-dimensional descriptors. Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of Coulomb matrices. To account for the shape and composition of molecules, we combine these descriptors with six-dimensional features informed by the Gershgorin circle theorem. We use the nine-dimensional descriptors thus obtained for Gaussian process regression based on kernels with variable functional form, leading to extremely efficient, low-dimensional interpolation models. The resulting models trained with 100 molecules are able to predict the product of entropy and temperature ($S \times T$) and zero point vibrational energy (ZPVE) with the absolute error under 1 kcal mol$^{-1}$ for $> 78$ \% and under 1.3 kcal mol$^{-1}$ for $> 92$ \% of molecules in the test data. The test data comprises 20,000 molecules with complexity varying from three atoms to 29 atoms and the ranges of $S \times T$ and ZPVE covering 36 kcal mol$^{-1}$ and 161 kcal mol$^{-1}$, respectively. We also illustrate that the descriptors based on the Gershgorin circle theorem yield more accurate models of molecular entropy than those based on graph neural networks that explicitly account for the atomic connectivity of molecules.

Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

TL;DR

(entropy×temperature) and

across up to 20,000 molecules, and Gershgorin-based descriptors outperform GNN-based descriptors for entropy. This data-efficient framework enables Bayesian optimization in chemical space and reduces reliance on large labeled datasets or expensive quantum calculations, with broad implications for rapid molecular design.

Abstract

) and zero point vibrational energy (ZPVE) with the absolute error under 1 kcal mol

for

\% and under 1.3 kcal mol

for

\% of molecules in the test data. The test data comprises 20,000 molecules with complexity varying from three atoms to 29 atoms and the ranges of

and ZPVE covering 36 kcal mol

and 161 kcal mol

, respectively. We also illustrate that the descriptors based on the Gershgorin circle theorem yield more accurate models of molecular entropy than those based on graph neural networks that explicitly account for the atomic connectivity of molecules.

Paper Structure (12 sections, 18 equations, 12 figures, 1 table)

This paper contains 12 sections, 18 equations, 12 figures, 1 table.

Introduction
Dataset summary
Method
Descriptor I: Distributions of eigenvalues of Coulomb matrices
Descriptor II: Graph Convolution Networks
Descriptor III: Gershgorin circle theorem
ML method: Gaussian Process Regression with variable kernels
Results
Analysis of different low-dimensional descriptors
Data-efficient models of entropy
Zero-point Vibrational Energy
Summary

Figures (12)

Figure 1: Schematic diagram illustrating the use of GCN to create molecular descriptor ($H^4_{\rm{C_2H_4O}}$) with C$_2$H$_4$O as an example. The GCN used in this work, as demonstrated in this figure, has four layers ($\lambda +1 = 4$). The resulting molecular descriptor is nine dimensional, because the largest molecule in QM9 dataset considered here has 9 heavy atoms.
Figure 2: Schematic diagram of the algorithm used in the present work for training GCN by optimizing weight matrices ($W_1$, $W_2$, $W_3$). Each molecule in A is processed as described in Fig. \ref{['fig:Fig1']} to yield a unique GCN vector ($H^4_{\rm{molecule}}$), used as the molecular descriptor. A subset of molecules in A, yielding a set of vectors depicted in C, is used to train a Gaussian process (GP) model of entropy. The RMSE of the GP predicted values is used in step E as the loss function to optimize the weight matrices ($W_1$, $W_2$, $W_3$) of GCN by gradient descent.
Figure 3: Left: Molecular probability density function for two isomers of ${\rm C_{7}H_{10}O_{2}}$ based on the Gershgorin circle theorem. The black doted curves show the atomic reference curves (HCNOF). Right: Molecular probability density function for an isomer of ${\rm C_{7}H_{10}O_{2}}$ and the reference molecule based on the Gershgorin circle theorem. For all $f_{\rm C_{7}H_{10}O_{2}}$ and $f_{\rm ref}$, $d_i = M_{ii}$, and $\tau = 1$.
Figure 4: Schematic diagram of the kernel construction algorithm used in this work to increase the complexity of the kernel function for optimal kernel models.
Figure 5: Test MAE of GP regression of entropy trained with different combinations of molecular descriptors: circles -- 3D CM($\epsilon_{\rm max}$, $\mu(\epsilon>1)$, $\sigma(\epsilon>1)$) with $\rm{RQ+RQ}$ kernel; filled-diamonds -- CM + GCN-PC($k = 5$) with $\rm{RQ+RQ}$ kernel; empty diamonds -- CM + GCN-PC($k = 9$) with $\rm{RQ+RQ}$ kernel; up-triangles -- the 4D $\rm{CM}+\rm{AUC}[f_{\rm{molecule}}]$ with $\rm{RQ + RQ + MAT}$ kernel; down-triangles -- the 9D $\rm{CM}+\rm{AUC}[f_{\rm{molecule}}]+\langle f_{\rm HCNOF}, f_{\rm molecule} \rangle$ descriptor trained with $\rm{RQ + MAT \times DP}$ kernel. The kernel of each model is optimized as described in text with BIC as the model selection metric. The lines represent the average of ten sets of training using different distributions of molecules in the training set, and the shaded area is the standard deviation of the results.
...and 7 more figures

Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

TL;DR

Abstract

Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

Authors

TL;DR

Abstract

Table of Contents

Figures (12)