Table of Contents
Fetching ...

Thermodynamic Descriptors from Molecular Dynamics as Machine Learning Features for Extrapolable Property Prediction

Nuria H. Espejo, Pablo Llombart, Andrés González de Castilla, Jorge Ramirez, Jorge R. Espinosa, Adiran Garaizar

Abstract

Machine learning (ML) models which rely on molecular structure excel at predicting properties for well-represented organic compounds, however their limited ability to extrapolate to chemotypes outside their training domain, remains a critical bottleneck in chemical discovery. This challenge is particularly acute in industrial discovery, where navigating uncharted chemical space to generate new intellectual property is a primary objective. Normal boiling points serve as a key benchmark for testing the extrapolative power of ML algorithms. A major limitation is that group-contribution methods are by design unable to generate predictions for molecules containing unparameterized fragments. Here, we demonstrate that this limitation can be overcome by replacing structural descriptors with thermodynamic properties computed directly from molecular dynamics simulations. We introduce a physics-augmented framework where a CatBoost regression model learns directly from ensemble-averaged cohesive energies, heats of vaporization, and densities extracted from atomistic liquid-phase simulations. Benchmark comparisons reveal that while both our physics-augmented model and conventional structure-based models perform comparably well on standard organic compounds, only the former maintains controlled error growth when extrapolating to structurally dissimilar chemical space. Our model successfully predicts boiling points for chemical classes entirely absent from training -- including inorganic compounds, salts, and molecules with elements like Si, B, and Te -- where structure-based models are fundamentally inapplicable. By encoding the intermolecular forces governing phase behavior, our framework establishes a generalizable strategy for property prediction beyond the structural boundaries of the existing methods.

Thermodynamic Descriptors from Molecular Dynamics as Machine Learning Features for Extrapolable Property Prediction

Abstract

Machine learning (ML) models which rely on molecular structure excel at predicting properties for well-represented organic compounds, however their limited ability to extrapolate to chemotypes outside their training domain, remains a critical bottleneck in chemical discovery. This challenge is particularly acute in industrial discovery, where navigating uncharted chemical space to generate new intellectual property is a primary objective. Normal boiling points serve as a key benchmark for testing the extrapolative power of ML algorithms. A major limitation is that group-contribution methods are by design unable to generate predictions for molecules containing unparameterized fragments. Here, we demonstrate that this limitation can be overcome by replacing structural descriptors with thermodynamic properties computed directly from molecular dynamics simulations. We introduce a physics-augmented framework where a CatBoost regression model learns directly from ensemble-averaged cohesive energies, heats of vaporization, and densities extracted from atomistic liquid-phase simulations. Benchmark comparisons reveal that while both our physics-augmented model and conventional structure-based models perform comparably well on standard organic compounds, only the former maintains controlled error growth when extrapolating to structurally dissimilar chemical space. Our model successfully predicts boiling points for chemical classes entirely absent from training -- including inorganic compounds, salts, and molecules with elements like Si, B, and Te -- where structure-based models are fundamentally inapplicable. By encoding the intermolecular forces governing phase behavior, our framework establishes a generalizable strategy for property prediction beyond the structural boundaries of the existing methods.
Paper Structure (17 sections, 6 figures)

This paper contains 17 sections, 6 figures.

Figures (6)

  • Figure 1: Schematic of the physics-augmented workflow for normal boiling point prediction. Molecular structures are built from SMILES strings and subjected to short all-atom molecular dynamics simulations using the OPLS and OpenFF forcefields. Thermodynamics properties, including cohesive energy and heat of vaporization, are extracted from the simulations and used as physics-augmented descriptors to train a CatBoost regression model for the final boiling point prediction.
  • Figure 2: Correlation between simulation-derived cohesive energy ($E_{\text{coh}}$) and experimental boiling point ($T_b$) for 1,280 organic compounds simulated at three temperatures using two independent force fields. (a) OpenFF-2.0.0 wagner2021openforcefield results showing linear relationships at 300 K (blue), 400 K (green), and 500 K (pink) with corresponding $R^2$ values of 0.73, 0.76, and 0.73, respectively. (b) OPLS4 lu2021opls4 results at the same temperature conditions yielding $R^2$ values of 0.77, 0.81, and 0.82. Each data point represents ensemble-averaged intermolecular interaction energies extracted from 20 ns NPT simulations. Data points at near-zero cohesive energy (predominantly at 500K) correspond to compounds that have undergone phase transition to the gas phase during simulation, exhibiting negligible intermolecular interactions due to substantial box expansion and reduced density. They are kept intentionally for the ML model to learn when MD features are meaningful. Linear regression lines are fitted to the full dataset at each temperature, encompassing both liquid-phase and transitioned gas-phase systems.
  • Figure 3: Performance comparison of machine learning models trained on different descriptor sets for boiling point prediction. (a) Cross-validated predicted versus experimental boiling points for the MD-only models. The plot shows the out-of-sample predictions on the 1,280-compound training set, obtained from the 4-fold cross-validation procedure described in the methods. The models were trained exclusively on thermodynamic descriptors from simulations with the OPLS4 (purple) and OpenFF-2.0.0 (yellow) force fields. The solid black line represents a perfect prediction. (b) Cross-validated prediction errors across multiple model architectures. Bars represent mean absolute error (MAE) and root mean squared error (RMSE) for: MD-only models (dark blue), chemoinformatics-only models trained on molecular fingerprints and 2D descriptors (teal), hybrid models combining both descriptor types (dark green), and a literature baseline using Random Forest on the original uncurated dataset (light green) kim2024integrating. Solid bars correspond to OPLS4-derived features, while hatched bars show results for OpenFF-2.0.0 features. (c) Normalized feature importance (in %) for the OPLS4-based model architectures. The horizontal bars show the relative contribution of the most significant features. Blue segments represent MD-derived features red segments represent chemoinformatics descriptors, and gray represents the cumulative importance of other minor structural features. Top bar (MD-only, 3 features selected): Importance is concentrated in thermodynamic features like the heat of vaporization ($\Delta H_{\text{vap}}$). Middle bar (Chemoinformatics-only, $>$1000 features used): Importance is led by molecular weight, followed by the information content of the characteristic polynomial (Ipc) and van der Waals surface area (VSA). Bottom bar (Hybrid, $>$1000 features used): A synergistic model where a thermodynamic features are complemented by key structural descriptors.
  • Figure 4: Extrapolative performance of MD-based models on structurally novel and complex chemical systems. (a) Mean Absolute Error (MAE) on a curated test set of 32 complex organic molecules for various models: GRAPPA (pink), Rarey-Nanoolal (yellow), MD-only (dark blue), hybrid MD+chemoinformatics (dark green), chemoinformatics-only (light green), and the Joback method (light blue). The average Tanimoto similarity of the test set to the respective training sets is noted for GRAPPA (0.82) and our models (0.38), highlighting the greater novelty of the test set for our approach. (b) MAE comparison between the MD-only model (dark blue) and GRAPPA (pink), stratified by Tanimoto similarity between test compounds and each model's respective training set. The MD-only model demonstrates consistently lower error, with its advantage growing significantly as structural similarity decreases. Each bin contains at least 3 data points. (c) Predicted versus experimental boiling points for chemical systems outside the applicability domain of most existing predictors. The model successfully predicts boiling points for neutral compounds with uncommon elements like Si, B, and Te (pink circles, e.g., tribromosilane) and charged systems including salts and ionic liquids (blue stars, e.g., Acesulfame K and an IL-methanol mixture).
  • Figure S1: Normalized feature importance (in %) for the OpenFF-based model architectures. The horizontal bars show the relative contribution of the most significant features. Blue segments represent MD-derived features red segments represent chemoinformatics descriptors, and gray represents the cumulative importance of either minor structural features. Top bar (MD-only, 3 features selected): Importance is concentrated in thermodynamic features like the heat of vaporization ($\Delta H_{\text{vap}}$). Middle bar (Chemoinformatics-only, $>$1000 features used): Importance is led by molecular weight, followed by the information content of the characteristic polynomial (Ipc) and Topological Polar Surface Area (TPSA). Bottom bar (Hybrid, $>$1000 features used): A synergistic model where a thermodynamic features are complemented by key structural descriptors.
  • ...and 1 more figures