Table of Contents
Fetching ...

NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers

Liujia Yang, Zhuo Yang, Jiaqing Xie, Yubin Wang, Ben Gao, Tianfan Fu, Xingjian Wei, Jiaxing Sun, Jiang Wu, Conghui He, Yuqiang Li, Qinying Gu

TL;DR

This work tackles automated molecular structure elucidation from experimental NMR spectra by introducing NMRSpec, a large-scale corpus of real 1H and 13C spectra, and NMRTrans, a set-based Transformer that respects the unordered, permutation-invariant nature of NMR peak data. By encoding spectra as peak sets with spectrum-aware features and processing them through Induced Set Attention Blocks, NMRTrans achieves state-of-the-art performance on experimental benchmarks and demonstrates strong generalization to out-of-distribution data. The approach combines a permutation-invariant encoder with a cross-attentive, order-agnostic decoder to produce canonical SMILES, reinforced by a carefully curated dataset and extensive ablations that highlight the value of experimental data and architecture aligned with NMR physics. Overall, NMRSpec and NMRTrans advance scalable, reliable NMR-based structure elucidation with practical implications for high-throughput chemistry and automated spectral analysis.

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is fundamental for molecular structure elucidation, yet interpreting spectra at scale remains time-consuming and highly expertise-dependent. While recent spectrum-as-language modeling and retrieval-based methods have shown promise, they rely heavily on large corpora of computed spectra and exhibit notable performance drops when applied to experimental measurements. To address these issues, we build NMRSpec, a large-scale corpus of experimental $^1$H and $^{13}$C spectra mined from chemical literature, and propose NMRTrans, which models spectra as unordered peak sets and aligns the model's inductive bias with the physical nature of NMR. To our best knowledge, NMRTrans is the first NMR Transformer trained solely on large-scale experimental spectra and achieves state-of-the-art performance on experimental benchmarks, improving Top-10 Accuracy over the strongest baseline by +17.82 points (61.15% vs. 43.33%), and underscoring the importance of experimental data and structure-aware architectures for reliable NMR structure elucidation.

NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers

TL;DR

This work tackles automated molecular structure elucidation from experimental NMR spectra by introducing NMRSpec, a large-scale corpus of real 1H and 13C spectra, and NMRTrans, a set-based Transformer that respects the unordered, permutation-invariant nature of NMR peak data. By encoding spectra as peak sets with spectrum-aware features and processing them through Induced Set Attention Blocks, NMRTrans achieves state-of-the-art performance on experimental benchmarks and demonstrates strong generalization to out-of-distribution data. The approach combines a permutation-invariant encoder with a cross-attentive, order-agnostic decoder to produce canonical SMILES, reinforced by a carefully curated dataset and extensive ablations that highlight the value of experimental data and architecture aligned with NMR physics. Overall, NMRSpec and NMRTrans advance scalable, reliable NMR-based structure elucidation with practical implications for high-throughput chemistry and automated spectral analysis.

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is fundamental for molecular structure elucidation, yet interpreting spectra at scale remains time-consuming and highly expertise-dependent. While recent spectrum-as-language modeling and retrieval-based methods have shown promise, they rely heavily on large corpora of computed spectra and exhibit notable performance drops when applied to experimental measurements. To address these issues, we build NMRSpec, a large-scale corpus of experimental H and C spectra mined from chemical literature, and propose NMRTrans, which models spectra as unordered peak sets and aligns the model's inductive bias with the physical nature of NMR. To our best knowledge, NMRTrans is the first NMR Transformer trained solely on large-scale experimental spectra and achieves state-of-the-art performance on experimental benchmarks, improving Top-10 Accuracy over the strongest baseline by +17.82 points (61.15% vs. 43.33%), and underscoring the importance of experimental data and structure-aware architectures for reliable NMR structure elucidation.
Paper Structure (41 sections, 5 theorems, 19 equations, 12 figures, 7 tables)

This paper contains 41 sections, 5 theorems, 19 equations, 12 figures, 7 tables.

Key Result

lemma 1

Let $\mathbf{Q} \in \mathbb{R}^{n \times d}$, $\mathbf{K} \in \mathbb{R}^{m \times d}$, and $\mathbf{V} \in \mathbb{R}^{m \times d}$. The MAB satisfies:

Figures (12)

  • Figure 1: Local chemical environments determine NMR spectral features.
  • Figure 2: Left: Curation of our NMRSpec. Right: Pipeline of NMRTrans: Set Transformer encoders for $^1$H/$^{13}$C peak sets, feature concatenation (optionally with molecular formula), and a T5 decoder for SMILES generation.
  • Figure 3: Detailed performance analysis of NMRTrans with NMRMind. (a) Top-1 and Top-10 Sequence accuracy comparison under varying input modality combinations (e.g., $^{1}$H, $^{13}$C, Formula). (b) Percentage of test samples where the Top-5 predictions meet specific Tanimoto similarity thresholds (x-axis: similarity score $\ge 0.5, 0.7, \dots, 1.0$). (c) Prediction accuracy (Top-1 and Top-5) as a function of molecular complexity (number of heavy atoms).
  • Figure 4: NMRTrans successfully reconstructs challenging motifs including long aliphatic chains, heterocycles, and heavy molecules ($\geq$40 atoms), with ground-truth structures recovered within the Top-3 predictions for ambiguous cases.
  • Figure 5: Impact of architectural inductive bias on training dynamics. The curve compares validation accuracy of NMRTrans with and without Positional Encodings (PE), demonstrating that removing PE accelerates convergence and improves final performance.
  • ...and 7 more figures

Theorems & Definitions (5)

  • lemma 1: Permutation Properties of MAB
  • proposition 1: ISAB is Equivariant
  • proposition 2: PMA is Invariant
  • lemma 2: Cross-Attention Invariance
  • theorem 1: End-to-End Order Independence