Table of Contents
Fetching ...

Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra

Ziyu Xiong, Yichi Zhang, Foyez Alauddin, Chu Xin Cheng, Joon Soo An, Mohammad R. Seyedsayamdost, Ellen D. Zhong

TL;DR

This work tackles de novo small-molecule structure elucidation from 1D NMR spectra by introducing ChefNMR, a conditional diffusion pipeline that encodes spectra with a hybrid NMR-ConvFormer and generates 3D coordinates via a Diffusion Transformer conditioned on the chemical formula. A large synthetic dataset, SpectraNP, enables training on complex natural products, and ChefNMR achieves state-of-the-art performance on synthetic benchmarks with notable zero-shot generalization to experimental spectra. The approach combines strong spectral embeddings, diffusion-based 3D generation, and classifier-free guidance to robustly infer molecular structures, suggesting substantial potential to accelerate natural product discovery. Limitations include domain shift between synthetic and real spectra and the need for additional spectral modalities and confidence estimation to support real-world deployment.

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is a cornerstone technique for determining the structures of small molecules and is especially critical in the discovery of novel natural products and clinical therapeutics. Yet, interpreting NMR spectra remains a time-consuming, manual process requiring extensive domain expertise. We introduce ChefNMR (CHemical Elucidation From NMR), an end-to-end framework that directly predicts an unknown molecule's structure solely from its 1D NMR spectra and chemical formula. We frame structure elucidation as conditional generation from an atomic diffusion model built on a non-equivariant transformer architecture. To model the complex chemical groups found in natural products, we generated a dataset of simulated 1D NMR spectra for over 111,000 natural products. ChefNMR predicts the structures of challenging natural product compounds with an unsurpassed accuracy of over 65%. This work takes a significant step toward solving the grand challenge of automating small-molecule structure elucidation and highlights the potential of deep learning in accelerating molecular discovery. Code is available at https://github.com/ml-struct-bio/chefnmr.

Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra

TL;DR

This work tackles de novo small-molecule structure elucidation from 1D NMR spectra by introducing ChefNMR, a conditional diffusion pipeline that encodes spectra with a hybrid NMR-ConvFormer and generates 3D coordinates via a Diffusion Transformer conditioned on the chemical formula. A large synthetic dataset, SpectraNP, enables training on complex natural products, and ChefNMR achieves state-of-the-art performance on synthetic benchmarks with notable zero-shot generalization to experimental spectra. The approach combines strong spectral embeddings, diffusion-based 3D generation, and classifier-free guidance to robustly infer molecular structures, suggesting substantial potential to accelerate natural product discovery. Limitations include domain shift between synthetic and real spectra and the need for additional spectral modalities and confidence estimation to support real-world deployment.

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is a cornerstone technique for determining the structures of small molecules and is especially critical in the discovery of novel natural products and clinical therapeutics. Yet, interpreting NMR spectra remains a time-consuming, manual process requiring extensive domain expertise. We introduce ChefNMR (CHemical Elucidation From NMR), an end-to-end framework that directly predicts an unknown molecule's structure solely from its 1D NMR spectra and chemical formula. We frame structure elucidation as conditional generation from an atomic diffusion model built on a non-equivariant transformer architecture. To model the complex chemical groups found in natural products, we generated a dataset of simulated 1D NMR spectra for over 111,000 natural products. ChefNMR predicts the structures of challenging natural product compounds with an unsurpassed accuracy of over 65%. This work takes a significant step toward solving the grand challenge of automating small-molecule structure elucidation and highlights the potential of deep learning in accelerating molecular discovery. Code is available at https://github.com/ml-struct-bio/chefnmr.

Paper Structure

This paper contains 36 sections, 15 equations, 15 figures, 21 tables, 5 algorithms.

Figures (15)

  • Figure 1: Natural products are small molecules secreted by natural sources such as plants, animals, and microorganisms (left). To identify an unknown molecule's structure, 1D NMR spectroscopy measures peaks corresponding to each proton ($^1$H) or carbon (${}^{13}$C) atom (middle). The resulting chemical shifts (x-axis locations), peak intensities, and J-coupling (splitting patterns) encode information on chemical groups and connectivities, from which the molecular structure can be deduced (right).
  • Figure 1: Summary of dataset statistics.
  • Figure 2: Overview of the ChefNMR architecture. (a) NMR-ConvFormer processes 1D NMR spectra into a vector embedding using the convolutional tokenizer, transformer encoder, and multihead attention pooling (MAP). (b) Diffusion Transformer predicts clean 3D coordinates $\hat{\bm{X}}_0$ from atom tokens formed by concatenating noisy coordinates $\bm{X}_\sigma$ and atom types $\bm{A}$, conditioned on the spectral embedding and noise level $\sigma$ via adaptive layer normalization peebles2023scalable.
  • Figure 3: Examples of ChefNMR's predictions on the synthetic SpectraNP dataset. (a) Correctly predicted diverse and complex natural products in top-1 predictions. (b) Incorrect top-2 predictions ranked by Tanimoto similarity remain chemically valid and structurally similar to the ground truth.
  • Figure 4: Zero-shot performance on experimental NMR spectra, shown as the mean $\pm$ standard deviation over three independent sampling runs. Models are trained on USPTO. Evaluation is on $^1$H and ${}^{13}$C spectra for SpecTeach, and on ${}^{13}$C spectra for NMRShiftDB2.${}^{*}$: reported results.
  • ...and 10 more figures