NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers
Liujia Yang, Zhuo Yang, Jiaqing Xie, Yubin Wang, Ben Gao, Tianfan Fu, Xingjian Wei, Jiaxing Sun, Jiang Wu, Conghui He, Yuqiang Li, Qinying Gu
TL;DR
This work tackles automated molecular structure elucidation from experimental NMR spectra by introducing NMRSpec, a large-scale corpus of real 1H and 13C spectra, and NMRTrans, a set-based Transformer that respects the unordered, permutation-invariant nature of NMR peak data. By encoding spectra as peak sets with spectrum-aware features and processing them through Induced Set Attention Blocks, NMRTrans achieves state-of-the-art performance on experimental benchmarks and demonstrates strong generalization to out-of-distribution data. The approach combines a permutation-invariant encoder with a cross-attentive, order-agnostic decoder to produce canonical SMILES, reinforced by a carefully curated dataset and extensive ablations that highlight the value of experimental data and architecture aligned with NMR physics. Overall, NMRSpec and NMRTrans advance scalable, reliable NMR-based structure elucidation with practical implications for high-throughput chemistry and automated spectral analysis.
Abstract
Nuclear Magnetic Resonance (NMR) spectroscopy is fundamental for molecular structure elucidation, yet interpreting spectra at scale remains time-consuming and highly expertise-dependent. While recent spectrum-as-language modeling and retrieval-based methods have shown promise, they rely heavily on large corpora of computed spectra and exhibit notable performance drops when applied to experimental measurements. To address these issues, we build NMRSpec, a large-scale corpus of experimental $^1$H and $^{13}$C spectra mined from chemical literature, and propose NMRTrans, which models spectra as unordered peak sets and aligns the model's inductive bias with the physical nature of NMR. To our best knowledge, NMRTrans is the first NMR Transformer trained solely on large-scale experimental spectra and achieves state-of-the-art performance on experimental benchmarks, improving Top-10 Accuracy over the strongest baseline by +17.82 points (61.15% vs. 43.33%), and underscoring the importance of experimental data and structure-aware architectures for reliable NMR structure elucidation.
