Table of Contents
Fetching ...

TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning

Yunrui Li, Hao Xu, Ambrish Kumar, Duosheng Wang, Christian Heiss, Parastoo Azadi, Pengyu Hong

TL;DR

TransPeakNet tackles the challenge of predicting HSQC cross-peaks in 2D NMR by leveraging a solvent-aware graph neural network and a two-stage transfer strategy. It pre-trains on a large 1D NMR corpus with Multi-Task learning and then refines predictions on unlabeled HSQC data through iterative unsupervised fine-tuning and pseudo-annotation, explicitly accounting for solvent effects. The approach achieves state-of-the-art MAEs of $2.05$ ppm for $^{13}$C and $0.165$ ppm for $^{1}$H, with $95.21\%$ concordance to expert peak assignments on a 479-molecule test set, and outperforms traditional tools, especially for large or saccharide-rich molecules. The results demonstrate robust generalization across molecular weights and chemical classes, and the framework lays groundwork for extending to other 2D NMR modalities via 3D-GNNs and broader solvent-aware predictions.

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is essential for revealing molecular structure, electronic environment, and dynamics. Accurate NMR shift prediction allows researchers to validate structures by comparing predicted and observed shifts. While Machine Learning (ML) has improved one-dimensional (1D) NMR shift prediction, predicting 2D NMR remains challenging due to limited annotated data. To address this, we introduce an unsupervised training framework for predicting cross-peaks in 2D NMR, specifically Heteronuclear Single Quantum Coherence (HSQC).Our approach pretrains an ML model on an annotated 1D dataset of 1H and 13C shifts, then finetunes it in an unsupervised manner using unlabeled HSQC data, which simultaneously generates cross-peak annotations. Our model also adjusts for solvent effects. Evaluation on 479 expert-annotated HSQC spectra demonstrates our model's superiority over traditional methods (ChemDraw and Mestrenova), achieving Mean Absolute Errors (MAEs) of 2.05 ppm and 0.165 ppm for 13C shifts and 1H shifts respectively. Our algorithmic annotations show a 95.21% concordance with experts' assignments, underscoring the approach's potential for structural elucidation in fields like organic chemistry, pharmaceuticals, and natural products.

TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning

TL;DR

TransPeakNet tackles the challenge of predicting HSQC cross-peaks in 2D NMR by leveraging a solvent-aware graph neural network and a two-stage transfer strategy. It pre-trains on a large 1D NMR corpus with Multi-Task learning and then refines predictions on unlabeled HSQC data through iterative unsupervised fine-tuning and pseudo-annotation, explicitly accounting for solvent effects. The approach achieves state-of-the-art MAEs of ppm for C and ppm for H, with concordance to expert peak assignments on a 479-molecule test set, and outperforms traditional tools, especially for large or saccharide-rich molecules. The results demonstrate robust generalization across molecular weights and chemical classes, and the framework lays groundwork for extending to other 2D NMR modalities via 3D-GNNs and broader solvent-aware predictions.

Abstract

Nuclear Magnetic Resonance (NMR) spectroscopy is essential for revealing molecular structure, electronic environment, and dynamics. Accurate NMR shift prediction allows researchers to validate structures by comparing predicted and observed shifts. While Machine Learning (ML) has improved one-dimensional (1D) NMR shift prediction, predicting 2D NMR remains challenging due to limited annotated data. To address this, we introduce an unsupervised training framework for predicting cross-peaks in 2D NMR, specifically Heteronuclear Single Quantum Coherence (HSQC).Our approach pretrains an ML model on an annotated 1D dataset of 1H and 13C shifts, then finetunes it in an unsupervised manner using unlabeled HSQC data, which simultaneously generates cross-peak annotations. Our model also adjusts for solvent effects. Evaluation on 479 expert-annotated HSQC spectra demonstrates our model's superiority over traditional methods (ChemDraw and Mestrenova), achieving Mean Absolute Errors (MAEs) of 2.05 ppm and 0.165 ppm for 13C shifts and 1H shifts respectively. Our algorithmic annotations show a 95.21% concordance with experts' assignments, underscoring the approach's potential for structural elucidation in fields like organic chemistry, pharmaceuticals, and natural products.
Paper Structure (17 sections, 1 equation, 7 figures)

This paper contains 17 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Illustration of TransPeakNet model design (A) and training strategy ((B) and (C)). (A) The model takes a molecular structure and derives its atomic representations from a GNN. The solvent information is encoded into a latent representation via the Solvent encoder. The representation of each atom is concatenated with the solvent representation, which is then used to predict the cross shifts of carbon and proton. (B) Model pertaining on the annotated 1D NMR dataset using MTT. (C) The pre-trained model is refined through an unsupervised process using the unlabeled HSQC dataset. The final output of the model has both the HSQC cross-peaks and atom alignment.
  • Figure 2: (A) MAEs of C--H shift prediction on test dataset. (B) Peak assignment accuracy by comparing algorithm-generated annotations with expert annotations. Out of the 479 molecules in the test set, 456 molecules have all peaks annotated correctly. For the remaining 23 molecules, 81.56% of the peaks agree with expert annotations. (C) An example of using our model to accurately predict cross peaks and align them with experimental signals. The molecule is shown at the top-left, where each C--H bond is labeled with a numerical identifier. Notably, the symmetric pairs of bonds (labeled as "2", "3", and "4") are each expected to generate a single HSQC cross peak due to their structural equivalence. The HSQC cross-peaks predicted by our model (in orange) and their alignments to the experimental observations (in blue) are plotted in the right. The alignments are indicated by the dash circles.
  • Figure 3: (A) The distribution of 9 solvent classes in the training dataset. (B) Solvent effect on proton shift prediction. When using the correct solvent information, the model provides the most accurate shift prediction. In most cases, specifying the solvent as "unknown" yields better performance, than using a wrong solvent as input. The acid solvent environment is marked as "N/A" in the table because it was not captured in the test dataset due to its low presence in the dataset.
  • Figure 4: (A) Performance comparison between our proposed model and established traditional tools on randomly sampled molecules from the test dataset. Our model performs better across all molecular weight categories. The advantage of our approach is increasingly evident as molecular size increases. The overall result uses equal weight for the molecular weight categories. (B) Comparing our model, ChemDraw, and Mestrenova on two typical examples. A small molecule (a) with weight of $\sim$250 Dalton and a larger molecule (b) with weight of $\sim$500 Dalton. The observed experimental signals and the predicted signals are colored in blue and orange, respectively. The prediction error (MAEs) is shown in the bottom right corner of each plot. Our model performs better than ChemDraw and Mestrenova, and particularly excels in handling large molecules with complex conformations.
  • Figure 5: Model performance comparison on different segmented categories, including (A) molecular weights and (B) saccharides.
  • ...and 2 more figures