Table of Contents
Fetching ...

Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents

Nicholas J. Williams, Lara Kabalan, Ljiljana Stojanovic, Viktor Zolyomi, Edward O. Pyzer-Knapp

TL;DR

Hessian QM9 is presented, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the ωB97x/6-31G* level, and it is shown that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments.

Abstract

A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $ω$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic solvent environments for experimental characterization.

Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents

TL;DR

Hessian QM9 is presented, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the ωB97x/6-31G* level, and it is shown that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments.

Abstract

A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic solvent environments for experimental characterization.
Paper Structure (12 sections, 3 equations, 2 figures, 2 tables)

This paper contains 12 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: a, illustration of the global scalar/energy of a graph/molecule (left) and the vector/forces of the node/atom (right). Below is an illustration of the second-order derivative of the potential energy surface between two nodes/atoms giving a 3x3 Hessian matrix. b, illustrates conditioning a model to a one-dimensional function (black dashed line), where the top plot uses the function value and its gradient to train the model (filled blue line), while the bottom plot also includes the Hessian to train the model (filled orange line) demonstrating an improved approximation with higher-order derivatives. c, mean absolute error (MAE) between the predicted and calculated vibrational frequencies for a MLIP trained on energy and forces (blue) and a MLIP fine-tuned on Hessian data. The box-plot on the left shows the MAE for the full spectrum between $\mathrm{4000-400 \ cm^{-1}}$, subsequent box-plots represent characteristic wavenumbers, where $\mathrm{3600-2800 \ cm^{-1}}$ represents stretching vibrations for C-H, O-H and N-H bonds, $\mathrm{2000-1500 \ cm^{-1}}$ represents stretching vibrations of shorter double bonds C=C and C=O, $\mathrm{1500-400 \ cm^{-1}}$ represents the fingerprint region which contains a complex pattern of absorption bands that are specific to each molecule, $\mathrm{400-10 \ cm^{-1}}$ represents longer range molecular bending and torsional motions.
  • Figure 2: a, UMAP density plot of the QM9 dataset using SOAP descriptors where contour lines show the density of the whole dataset and the black dots show the 40,000 sampled data-points using farthest point sampling. b-e, histograms of the energy differences, eV, from linear regression of atomic energies for sampled ground state configurations in different environments. The energy differences are calculated as deviations from a reference model where the atomic energies of hydrogen, carbon, nitrogen, and oxygen are approximately $-16.67$, $-1035.81$, $-1489.65$, and $-2047.05$ eV, respectively. In vacuum, b, the mean energy difference of $-0.05$ eV and a standard deviation of 1.19 eV, in THF, c, the mean energy difference is $-0.44$ eV with a standard 1.17 eV, in toluene, d, the mean is $-0.37$ eV and the standard deviation is 1.19 eV, and in water, e, the mean energy difference is $-0.45$ eV and the standard deviation is 1.11 eV.