Table of Contents
Fetching ...

Bayesian Optimization in Chemical Compound Sub-Spaces using Low-Dimensional Molecular Descriptors

Yun-Wen Mao, Roman V. Krems

TL;DR

A Bayesian optimization framework that identifies optimal molecular structures with high precision using fewer than 2,000 training data points within a chemical subspace containing more than 133,000 molecules and employs a low-dimensional and physics-informed molecular descriptor vector that facilitates data-efficient surrogate modelling and optimization.

Abstract

Efficient optimization of molecules with targeted properties remains a significant challenge due to the vast size and discrete nature of chemical compound space. Conventional machine-learning-based optimization approaches typically require large datasets to construct accurate surrogate models, limiting their applicability in data-scarce settings. In this study, we present a Bayesian optimization (BO) framework that identifies optimal molecular structures with high precision using fewer than 2,000 training data points within a chemical subspace containing more than 133,000 molecules. The framework employs a low-dimensional and physics-informed molecular descriptor vector that facilitates data-efficient surrogate modelling and optimization. A key innovation of the proposed framework is a reliable inverse mapping scheme that translates optimized points in the descriptor space back into chemically valid molecular structures, thereby bridging continuous optimization and discrete molecular design. We demonstrate the effectiveness of our approach on the QM9 benchmark dataset, where the framework successfully identifies organic molecules with the target entropy and zero-point vibrational energy (ZPVE) values.For entropy optimization, our approach achieves a 100% success rate while requiring fewer than 1,000 molecular evaluations in more than 80% of test cases. For ZPVE, the success rate exceeds 80% for molecules containing more than two heavy atoms. These results highlight the critical role of low-dimensional, interpretable descriptors in enabling data-efficient optimization and robust inverse molecular design, and establish Bayesian optimization as a practical tool for molecular discovery in small-data regimes.

Bayesian Optimization in Chemical Compound Sub-Spaces using Low-Dimensional Molecular Descriptors

TL;DR

A Bayesian optimization framework that identifies optimal molecular structures with high precision using fewer than 2,000 training data points within a chemical subspace containing more than 133,000 molecules and employs a low-dimensional and physics-informed molecular descriptor vector that facilitates data-efficient surrogate modelling and optimization.

Abstract

Efficient optimization of molecules with targeted properties remains a significant challenge due to the vast size and discrete nature of chemical compound space. Conventional machine-learning-based optimization approaches typically require large datasets to construct accurate surrogate models, limiting their applicability in data-scarce settings. In this study, we present a Bayesian optimization (BO) framework that identifies optimal molecular structures with high precision using fewer than 2,000 training data points within a chemical subspace containing more than 133,000 molecules. The framework employs a low-dimensional and physics-informed molecular descriptor vector that facilitates data-efficient surrogate modelling and optimization. A key innovation of the proposed framework is a reliable inverse mapping scheme that translates optimized points in the descriptor space back into chemically valid molecular structures, thereby bridging continuous optimization and discrete molecular design. We demonstrate the effectiveness of our approach on the QM9 benchmark dataset, where the framework successfully identifies organic molecules with the target entropy and zero-point vibrational energy (ZPVE) values.For entropy optimization, our approach achieves a 100% success rate while requiring fewer than 1,000 molecular evaluations in more than 80% of test cases. For ZPVE, the success rate exceeds 80% for molecules containing more than two heavy atoms. These results highlight the critical role of low-dimensional, interpretable descriptors in enabling data-efficient optimization and robust inverse molecular design, and establish Bayesian optimization as a practical tool for molecular discovery in small-data regimes.
Paper Structure (12 sections, 17 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 12 sections, 17 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Left: Molecular function of ${\rm H_{3}C_{4}N_{2}OF}$ (blue solid line) computed from \ref{['eq:f_molecule']}, with the shaded region highlighting the carbon peak. Right: Enhanced view of the carbon peak of $f_{\rm H_{3}C_{4}N_{2}OF}$ overlaid with $f_{Z=6}$ (orange dotted line given by \ref{['eq:N_Z']}). Parameters: $\omega_{Z} =10$, $\sigma_{Z}^{2} = 0.5$, $\beta = 1.2$, $s = 0.7$ and $\omega = 10$.
  • Figure 2: The carbon peaks in the molecular spectra of ${\rm H_{3}C_{4}N_{2}OF}$ (blue line with circles) and ${\rm H_{9}C_{6}NO_{2}}$ (red line with triangles). The values of inner products between atomic reference probability distribution $f_{Z=6}$ (orange dotted line) and each molecular function $f_{m}$ are displayed. All constants in \ref{['eq:N_Z']} and \ref{['eq:f_molecule']} are the same as in \ref{['fig:fig2']}.
  • Figure 3: Left: Estimated probability distributions $\hat{f}_{\nu, Z=6}$ for $\nu$ carbon atoms. The shaded regions indicate the range of $f_{\rm molecules}$ corresponding to each $\nu$. Results are shown for $\nu = 1$ (red line with squares), $\nu = 2$ (green line with diamonds), $\nu = 3$ (orange line with triangles), and $\nu = 4$ (blue line with circles). Right: Distributions of $\braket{f_6, f_{m}}$ for molecules with various $\nu$ values. For each $\nu$, a Gaussian distribution $h_{\nu, Z}$ with the mean $\mu = \braket{f_{6}, \hat{f}_{\nu, Z=6}}$ and standard deviation $\sigma = 4$ is overlaid.
  • Figure 4: An example application of the inverse chemical formula mapping algorithm with optimized hyperparameters for determining the stoichiometric coefficients from a set of molecular descriptors.
  • Figure 5: Molecular optimization results for entropy (top) and ZPVE (bottom). Target values are sampled uniformly across the full property range. For each target value, ten independent optimizations are performed. The bars represent the average number of BO iterations required to identify the target, and the error bars show the full range of the number of iterations across all trials. The solid green bars indicate 100$\%$ success rate, while the open bars correspond to cases with $<100\%$ success rate. The gray shaded region marks the $0\%$ success rate. For all calculations, the threshold parameter $\epsilon = 0.1\,{\rm kcal\,mol^{-1}}$.
  • ...and 1 more figures