Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations
Surajit Das, Raghunathan Ramakrishnan
TL;DR
The paper tackles accurate, scalable prediction of $^{13}$C NMR shielding using atom-centered ML descriptors. It introduces continuous variants of the atomic Coulomb matrix (aCM-RBF) and atomic bag-of-bonds (aBoB-RBF), augmented with neighbor information, and evaluates them with kernel ridge regression. The key finding is that the aBoB-RBF($4$) descriptor delivers state-of-the-art QM9NMR accuracy of $1.69$ ppm MAE and shows strong transferability to larger drug-like and biomolecular datasets (MAEs around $2.2$–$2.7$ ppm, $R^2\approx0.995$). The work also demonstrates practical deployment via the mlqm9nmr module, and shows that $ riangle$-ML with PM7 baselines can push predictions closer to high-level DFT benchmarks, enabling high-throughput NMR screening with near-DFT accuracy.
Abstract
Accurate prediction of nuclear magnetic resonance (NMR) shielding with machine learning (ML) models remains a central challenge for data-driven spectroscopy. We present atomic variants of the Coulomb matrix (aCM) and bag-of-bonds (aBoB) descriptors, and extend them using radial basis functions (RBFs) to yield smooth, per-atom representations (aCM-RBF, aBoB-RBF). Local structural information is incorporated by augmenting each atomic descriptor with contributions from the n nearest neighbors, resulting in the family of descriptors, aCM-RBF(n) and aBoB-RBF(n). For 13C shielding prediction on the QM9NMR dataset (831,925 shielding values across 130,831 molecules), aBoB-RBF(4) achieves an out-of-sample mean error of 1.69 ppm, outperforming models reported in previous studies. While explicit three-body descriptors further reduce errors at a higher cost, aBoB-RBF(4) offers the best balance of accuracy and efficiency. Benchmarking on external datasets comprising larger molecules (GDBm, Drug12/Drug40, and pyrimidinone derivatives) confirms the robustness and transferability of aBoB-RBF(4), establishing it as a practical tool for ML-based NMR shielding prediction.
