Table of Contents
Fetching ...

Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence

Frank Hu, Jonathan M. Tubb, Dimitris Argyropoulos, Sergey Golotvin, Mikhail Elyashberg, Grant M. Rotskoff, Matthew W. Kanan, Thomas E. Markland

TL;DR

The paper tackles the daunting problem of de novo structure elucidation from 1D NMR data by introducing an end-to-end transformer framework that predicts molecular structures directly from 1H/13C spectra without relying on molecular formulas or extra context. It leverages a Morgan fingerprint–based pretraining to learn substructure-to-structure mappings and integrates this into a multitask model that also predicts substructures from spectra. Key findings include 97.8% top-15 accuracy for reconstructing molecules from fingerprints up to 40 heavy atoms, and 55.2% structure-accuracy (top-15) for direct spectrum-to-structure predictions, with 46.6% using 1H alone and 5.5% using 13C alone; substructure prediction is highly confident (F1 ≈ 0.84). The approach remains effective when fine-tuned on limited experimental data (19.9% structure accuracy on 50 spectra) and is open-sourced to support broader adoption and integration with CASE workflows, signaling a significant step toward automated, data-efficient structure elucidation from routine NMR data, while highlighting remaining gaps in stereochemistry and experimental-to-simulated transfer.

Abstract

One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from $10^{20} - 10^{60}$. The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional $^1$H and/or $^{13}$C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 55.2% accuracy within the first 15 predictions using only the $^1$H and $^{13}$C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.

Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence

TL;DR

The paper tackles the daunting problem of de novo structure elucidation from 1D NMR data by introducing an end-to-end transformer framework that predicts molecular structures directly from 1H/13C spectra without relying on molecular formulas or extra context. It leverages a Morgan fingerprint–based pretraining to learn substructure-to-structure mappings and integrates this into a multitask model that also predicts substructures from spectra. Key findings include 97.8% top-15 accuracy for reconstructing molecules from fingerprints up to 40 heavy atoms, and 55.2% structure-accuracy (top-15) for direct spectrum-to-structure predictions, with 46.6% using 1H alone and 5.5% using 13C alone; substructure prediction is highly confident (F1 ≈ 0.84). The approach remains effective when fine-tuned on limited experimental data (19.9% structure accuracy on 50 spectra) and is open-sourced to support broader adoption and integration with CASE workflows, signaling a significant step toward automated, data-efficient structure elucidation from routine NMR data, while highlighting remaining gaps in stereochemistry and experimental-to-simulated transfer.

Abstract

One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from . The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional H and/or C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 55.2% accuracy within the first 15 predictions using only the H and C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.

Paper Structure

This paper contains 5 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: An overview of our structure elucidation framework consisting of (A): the multitask spectrum-to-structure/spectrum-to-substructure model that generates both structure and substructure predictions and (B): the substructure-to-structure pretraining approach that reconstructs SMILES strings from Morgan fingerprints. Weights from a transformer pretrained on the substructure-to-structure task are used to initialize the multitask model, as indicated by the arrow connecting the two and the different coloration of the encoder-decoder component of the multitask framework. Specific details regarding the transformer model architecture and multitask model architecture can be found in SI Section 1.
  • Figure 2: Results for the substructure-to-structure task. (A) Fraction of incorrect predictions and the average maximum Tanimoto similarity (MTS) of incorrect predictions to the target molecule as a function of the number of heavy atoms. The dashed lines are the simple moving averages for each quantity across sizes. (B) The distribution of the MTS to the target molecule with decompositions across different ranges of numbers of heavy atoms. (C) Examples of molecules that were predicted correctly by the substructure-to-structure transformer, all with 40 heavy atoms and varied elemental compositions.
  • Figure 3: Results for the spectrum-to-structure task. (A) Fraction of incorrect predictions and the average maximum Tanimoto similarity (MTS) of incorrect predictions to the target molecule as a function of the number of heavy atoms. The dashed lines are the simple moving averages for each quantity across sizes. (B) The distribution of the MTS to the target molecule with decompositions across different ranges of numbers of heavy atoms. (C) Examples of molecules that were predicted correctly by the multitask model alongside their 1H NMR spectra. All systems shown range from 35 - 40 heavy atoms.
  • Figure 4: (Top) Multitask model test set structure prediction accuracy resolved by element, where accuracy is the percentage of systems containing that element that the model predicted correctly. (Bottom) Test set substructure prediction accuracy resolved by element, where the accuracy is expressed as the F1 score and the substructure set fraction is the fraction of substructures in the entire substructure set that contain the specified element.