Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence
Frank Hu, Jonathan M. Tubb, Dimitris Argyropoulos, Sergey Golotvin, Mikhail Elyashberg, Grant M. Rotskoff, Matthew W. Kanan, Thomas E. Markland
TL;DR
The paper tackles the daunting problem of de novo structure elucidation from 1D NMR data by introducing an end-to-end transformer framework that predicts molecular structures directly from 1H/13C spectra without relying on molecular formulas or extra context. It leverages a Morgan fingerprint–based pretraining to learn substructure-to-structure mappings and integrates this into a multitask model that also predicts substructures from spectra. Key findings include 97.8% top-15 accuracy for reconstructing molecules from fingerprints up to 40 heavy atoms, and 55.2% structure-accuracy (top-15) for direct spectrum-to-structure predictions, with 46.6% using 1H alone and 5.5% using 13C alone; substructure prediction is highly confident (F1 ≈ 0.84). The approach remains effective when fine-tuned on limited experimental data (19.9% structure accuracy on 50 spectra) and is open-sourced to support broader adoption and integration with CASE workflows, signaling a significant step toward automated, data-efficient structure elucidation from routine NMR data, while highlighting remaining gaps in stereochemistry and experimental-to-simulated transfer.
Abstract
One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from $10^{20} - 10^{60}$. The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional $^1$H and/or $^{13}$C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 55.2% accuracy within the first 15 predictions using only the $^1$H and $^{13}$C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.
