Table of Contents
Fetching ...

Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations

Surajit Das, Raghunathan Ramakrishnan

TL;DR

The paper tackles accurate, scalable prediction of $^{13}$C NMR shielding using atom-centered ML descriptors. It introduces continuous variants of the atomic Coulomb matrix (aCM-RBF) and atomic bag-of-bonds (aBoB-RBF), augmented with neighbor information, and evaluates them with kernel ridge regression. The key finding is that the aBoB-RBF($4$) descriptor delivers state-of-the-art QM9NMR accuracy of $1.69$ ppm MAE and shows strong transferability to larger drug-like and biomolecular datasets (MAEs around $2.2$–$2.7$ ppm, $R^2\approx0.995$). The work also demonstrates practical deployment via the mlqm9nmr module, and shows that $ riangle$-ML with PM7 baselines can push predictions closer to high-level DFT benchmarks, enabling high-throughput NMR screening with near-DFT accuracy.

Abstract

Accurate prediction of nuclear magnetic resonance (NMR) shielding with machine learning (ML) models remains a central challenge for data-driven spectroscopy. We present atomic variants of the Coulomb matrix (aCM) and bag-of-bonds (aBoB) descriptors, and extend them using radial basis functions (RBFs) to yield smooth, per-atom representations (aCM-RBF, aBoB-RBF). Local structural information is incorporated by augmenting each atomic descriptor with contributions from the n nearest neighbors, resulting in the family of descriptors, aCM-RBF(n) and aBoB-RBF(n). For 13C shielding prediction on the QM9NMR dataset (831,925 shielding values across 130,831 molecules), aBoB-RBF(4) achieves an out-of-sample mean error of 1.69 ppm, outperforming models reported in previous studies. While explicit three-body descriptors further reduce errors at a higher cost, aBoB-RBF(4) offers the best balance of accuracy and efficiency. Benchmarking on external datasets comprising larger molecules (GDBm, Drug12/Drug40, and pyrimidinone derivatives) confirms the robustness and transferability of aBoB-RBF(4), establishing it as a practical tool for ML-based NMR shielding prediction.

Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations

TL;DR

The paper tackles accurate, scalable prediction of C NMR shielding using atom-centered ML descriptors. It introduces continuous variants of the atomic Coulomb matrix (aCM-RBF) and atomic bag-of-bonds (aBoB-RBF), augmented with neighbor information, and evaluates them with kernel ridge regression. The key finding is that the aBoB-RBF() descriptor delivers state-of-the-art QM9NMR accuracy of ppm MAE and shows strong transferability to larger drug-like and biomolecular datasets (MAEs around ppm, ). The work also demonstrates practical deployment via the mlqm9nmr module, and shows that -ML with PM7 baselines can push predictions closer to high-level DFT benchmarks, enabling high-throughput NMR screening with near-DFT accuracy.

Abstract

Accurate prediction of nuclear magnetic resonance (NMR) shielding with machine learning (ML) models remains a central challenge for data-driven spectroscopy. We present atomic variants of the Coulomb matrix (aCM) and bag-of-bonds (aBoB) descriptors, and extend them using radial basis functions (RBFs) to yield smooth, per-atom representations (aCM-RBF, aBoB-RBF). Local structural information is incorporated by augmenting each atomic descriptor with contributions from the n nearest neighbors, resulting in the family of descriptors, aCM-RBF(n) and aBoB-RBF(n). For 13C shielding prediction on the QM9NMR dataset (831,925 shielding values across 130,831 molecules), aBoB-RBF(4) achieves an out-of-sample mean error of 1.69 ppm, outperforming models reported in previous studies. While explicit three-body descriptors further reduce errors at a higher cost, aBoB-RBF(4) offers the best balance of accuracy and efficiency. Benchmarking on external datasets comprising larger molecules (GDBm, Drug12/Drug40, and pyrimidinone derivatives) confirms the robustness and transferability of aBoB-RBF(4), establishing it as a practical tool for ML-based NMR shielding prediction.

Paper Structure

This paper contains 28 sections, 32 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Distribution of $^{13}$C NMR chemical shifts, $\delta(^{13}{\rm C})$, in ppm, in the QM9NMR dataset and its subset with 5k molecules partitioned according to hybridization state of the carbon atom. The table in the inset presents '#': the number of atoms, while '$\mu$' and '$s$' denote the mean value and the standard deviation of the distribution.
  • Figure 2: Illustration of various descriptors explored in this study. (a) aCM: Atomic version of the Coulomb matrix, $M_{IJ}$, where the first index is the query atom, I, followed by the neighboring atoms sorted in increasing distance. $M_{IJ}$ are calculated using Eq. \ref{['eq:cm_mij']} and the upper triangular matrix is vectorized. (b) aBoB: aCM elements bagged as pairwise combinations of atom types ($A$ and $B$), $M_{IJ}^{(A,B)}$, where $I \in A$ and $J \in B$ as defined in Eq. \ref{['eq:abob_mij']}. Zeros are appended to each bag to ensure constant descriptor sizes across molecules. These bags are then concatenated in a particular sequence to construct the descriptor. (c) aCM-RBF: Continuous version of aCM calculated using Eq. \ref{['eq:con_acm']}. For a query atom, $I$, aCM is multiplied by a radial basis function (RBF) and summed over all the neighboring atoms $J$ to obtain the continuous function, ${\bf d}(r)$. (d) aBoB-RBF: Continuous version of aBoB calculated using Eq. \ref{['eq:con_abob_mab']}, resulting in radial functions for each combination of atom types, ${\bf d}^{(A,B)}(r)$.
  • Figure 3: Encoding of neighborhood information in atomic descriptors illustrated for the carbonyl carbon (C2) of acetaldehyde. The query atom's descriptor is ${\bf d}(0)$, which is concatenated with the descriptors of the neighboring atoms after damping with the $f_{\rm cos}(d_k; 2.0)$ function.
  • Figure 4: a) Learning curves on the log-log scale showing mean absolute errors (MAEs in ppm) for predicting $^{13}$C-NMR $\sigma_{\rm iso}$ values with increasing training set sizes. MAEs are calculated for 50k out-of-sample atoms in the QM9NMR dataset. All descriptors have information of the four nearest neighbor atoms (i.e., ${\bf d}(4)$). b) MAEs for a train:test split of 100k:50k for various descriptors with information of $n$ nearest atoms encoded. Previously reported values of 1.88 (FCHL, Ref. gupta2021revving) and 1.87 (MACE-OFF23-small, Ref. shiota2024universal) are shown as blue and red horizontal lines. Values for the descriptors studied in this work are shown for various $n$: aCM (maroon), aBoB (violet), aCM-RBF (orange), aBoB-RBF (skyblue), and aSLATM (green).
  • Figure 5: Analysis of errors of ML models predicting $^{13}C$ chemical shifts, $\delta(^{13}{\rm C})$, for pyrimidinone and GDBm validation sets. a) Scatterplot of ML-predicted vs. DFT-calculated $\delta(^{13}{\rm C})$ values for 208 pyrimidinone molecules. Results are shown for sp$^3$ and sp$^2$ C atoms. The density plot shows the distribution of the DFT results. Extreme values with larger MAEs are highlighted as A--E, and these atoms are shown in the corresponding molecules. b) Scatterplot of ML-predicted vs. DFT-calculated $\delta(^{13}{\rm C})$ values for 200 GDBm molecules along with the distribution of DFT values. MAEs (in ppm) for each value of $m$ are shown along with the coefficient of determination, $R^2$, as a bar plot.
  • ...and 2 more figures