TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning
Yunrui Li, Hao Xu, Ambrish Kumar, Duosheng Wang, Christian Heiss, Parastoo Azadi, Pengyu Hong
TL;DR
TransPeakNet tackles the challenge of predicting HSQC cross-peaks in 2D NMR by leveraging a solvent-aware graph neural network and a two-stage transfer strategy. It pre-trains on a large 1D NMR corpus with Multi-Task learning and then refines predictions on unlabeled HSQC data through iterative unsupervised fine-tuning and pseudo-annotation, explicitly accounting for solvent effects. The approach achieves state-of-the-art MAEs of $2.05$ ppm for $^{13}$C and $0.165$ ppm for $^{1}$H, with $95.21\%$ concordance to expert peak assignments on a 479-molecule test set, and outperforms traditional tools, especially for large or saccharide-rich molecules. The results demonstrate robust generalization across molecular weights and chemical classes, and the framework lays groundwork for extending to other 2D NMR modalities via 3D-GNNs and broader solvent-aware predictions.
Abstract
Nuclear Magnetic Resonance (NMR) spectroscopy is essential for revealing molecular structure, electronic environment, and dynamics. Accurate NMR shift prediction allows researchers to validate structures by comparing predicted and observed shifts. While Machine Learning (ML) has improved one-dimensional (1D) NMR shift prediction, predicting 2D NMR remains challenging due to limited annotated data. To address this, we introduce an unsupervised training framework for predicting cross-peaks in 2D NMR, specifically Heteronuclear Single Quantum Coherence (HSQC).Our approach pretrains an ML model on an annotated 1D dataset of 1H and 13C shifts, then finetunes it in an unsupervised manner using unlabeled HSQC data, which simultaneously generates cross-peak annotations. Our model also adjusts for solvent effects. Evaluation on 479 expert-annotated HSQC spectra demonstrates our model's superiority over traditional methods (ChemDraw and Mestrenova), achieving Mean Absolute Errors (MAEs) of 2.05 ppm and 0.165 ppm for 13C shifts and 1H shifts respectively. Our algorithmic annotations show a 95.21% concordance with experts' assignments, underscoring the approach's potential for structural elucidation in fields like organic chemistry, pharmaceuticals, and natural products.
