Table of Contents
Fetching ...

Synergistic cross-modal learning for experimental NMR-based structure elucidation

Fanjie Xu, Jinyuan Hu, Jingxiang Zou, Junjie Wang, Boying Huang, Zhifeng Gao, Xiaohong Ji, Weinan E, Zhong-Qun Tian, Fujie Tang, Jun Cheng

TL;DR

NMRPeak presents a unified cross-modal framework that jointly tackles spectrum prediction, cross-modal retrieval, and de novo structure generation for one-dimensional NMR data. By introducing a chemically-aware adaptive tokenizer and a peak-aware assignment-free similarity metric, the approach aligns spectral representations with real experimental workflows and enables rigorous cross-task interaction. The authors construct a large, experimentally grounded benchmark (~1.8 million spectra) to quantify the simulation-to-experiment gap and demonstrate that integrated training yields state-of-the-art performance: spectrum prediction closes much of the gap, molecular retrieval exceeds 95% top-1, and stereochemistry-aware generation reaches about 75% top-1 accuracy on experimental data. This synergy supports automated, high-throughput molecular structure elucidation and establishes a new paradigm for cross-modal learning in analytical chemistry.

Abstract

One-dimensional nuclear magnetic resonance (NMR) spectroscopy is essential for molecular structure elucidation in organic synthesis, drug discovery, natural product characterization, and metabolomics, yet its interpretation remains heavily dependent on expert knowledge and difficult to scale. Although machine learning has been applied to NMR spectrum prediction, library retrieval, and structure generation, these tasks have evolved in isolation using simulated data and incompatible spectral representations, limiting their utility under real experimental scenarios.Here we present NMRPeak, a unified cross-modal learning system that integrates these three tasks through experimentally grounded design. We curate approximately 1.8 million experimental and simulated spectra to construct the largest benchmark for NMR-based structure elucidation and systematically quantify the distribution shift between these domains. We introduce a chemically-aware adaptive tokenizer that dynamically balances discretization granularity to preserve spectral semantics while controlling vocabulary size, and an assignment-free peak-aware similarity metric that enables direct comparison between predicted and experimental spectra. Through a unified molecule-to-spectrum paradigm and synergistic coupling of prediction, retrieval, and generation modules, NMRPeak achieves transformative performance on experimental benchmarks: it overcomes the longstanding simulation-to-experiment gap in spectrum prediction while delivering over 95% top-1 accuracy in molecular retrieval and approximately 75% top-1 accuracy in stereochemistry-aware de novo structure generation. These capabilities establish a foundation for automated, high-throughput molecular structure elucidation in organic synthesis, drug discovery, and chemical biology.

Synergistic cross-modal learning for experimental NMR-based structure elucidation

TL;DR

NMRPeak presents a unified cross-modal framework that jointly tackles spectrum prediction, cross-modal retrieval, and de novo structure generation for one-dimensional NMR data. By introducing a chemically-aware adaptive tokenizer and a peak-aware assignment-free similarity metric, the approach aligns spectral representations with real experimental workflows and enables rigorous cross-task interaction. The authors construct a large, experimentally grounded benchmark (~1.8 million spectra) to quantify the simulation-to-experiment gap and demonstrate that integrated training yields state-of-the-art performance: spectrum prediction closes much of the gap, molecular retrieval exceeds 95% top-1, and stereochemistry-aware generation reaches about 75% top-1 accuracy on experimental data. This synergy supports automated, high-throughput molecular structure elucidation and establishes a new paradigm for cross-modal learning in analytical chemistry.

Abstract

One-dimensional nuclear magnetic resonance (NMR) spectroscopy is essential for molecular structure elucidation in organic synthesis, drug discovery, natural product characterization, and metabolomics, yet its interpretation remains heavily dependent on expert knowledge and difficult to scale. Although machine learning has been applied to NMR spectrum prediction, library retrieval, and structure generation, these tasks have evolved in isolation using simulated data and incompatible spectral representations, limiting their utility under real experimental scenarios.Here we present NMRPeak, a unified cross-modal learning system that integrates these three tasks through experimentally grounded design. We curate approximately 1.8 million experimental and simulated spectra to construct the largest benchmark for NMR-based structure elucidation and systematically quantify the distribution shift between these domains. We introduce a chemically-aware adaptive tokenizer that dynamically balances discretization granularity to preserve spectral semantics while controlling vocabulary size, and an assignment-free peak-aware similarity metric that enables direct comparison between predicted and experimental spectra. Through a unified molecule-to-spectrum paradigm and synergistic coupling of prediction, retrieval, and generation modules, NMRPeak achieves transformative performance on experimental benchmarks: it overcomes the longstanding simulation-to-experiment gap in spectrum prediction while delivering over 95% top-1 accuracy in molecular retrieval and approximately 75% top-1 accuracy in stereochemistry-aware de novo structure generation. These capabilities establish a foundation for automated, high-throughput molecular structure elucidation in organic synthesis, drug discovery, and chemical biology.
Paper Structure (24 sections, 23 equations, 9 figures)

This paper contains 24 sections, 23 equations, 9 figures.

Figures (9)

  • Figure 1: The NMRPeak framework.a, Overall architecture of NMRPeak, integrating three synergistic modules: NMRPeak-P for forward spectral simulation, and NMRPeak-R for cross-modal retrieval, and NMRPeak-G for inverse structural inference. b, The chemically-aware adaptive tokenizer. This component encodes $^{13}$C and $^{1}$H NMR peaks together with molecular formulas into a unified token space that includes special, categorical, and numerical tokens. The adaptive discretization strategy (right) optimizes the trade-off between vocabulary size and semantic resolution by dynamically adjusting token density based on prior knowledge of the data distribution. c, The peak-aware similarity metric for assignment-free spectral comparison. The algorithm employs a two-round matching process with count-based penalties to compute a final similarity score.
  • Figure 2: Data benchmarking and performance evaluation of NMRPeak on experimental datasets.a, UMAP mcinnes2018umap projection of structural distributions between simulated (MST-NMR alberts2024unraveling) and experimental (NMRexp wang2025nmrexp) datasets. b, Statistical summary of the curated NMR benchmark, comprising over 1.8 million structure–spectrum pairs across training, validation, and test splits for both simulated and experimental domains. c, Comparative analysis of top-1 generation accuracy (CHF-to-Mol) under different training-to-test scenarios for the baseline MST model alberts2024unraveling, NMRPeak-G (single module), and the unified NMRPeak framework. d, e, Representative case studies of complex molecular structure elucidation. Each panel displays the input molecular information (SMILES, 2D/3D structures, molecule formula, and text-based NMR peaks), alongside the forward spectrum simulation results and inverse structure inference results.
  • Figure 3: Performance of NMRPeak-P.a, Distribution of spectral similarity scores between predicted and experimental $^{13}$C (left) and $^{1}$H (right) NMR spectra, calculated using the peak-aware similarity metric. b, Correlation between predicted and experimental chemical shifts ($\delta$) for $^{13}$C (top) and $^{1}$H (bottom), derived from the first-round valid matching using the peak-aware similarity metric. c, Error distribution of predicted peak counts for $^{13}$C and $^{1}$H NMR spectra. For $^{1}$H spectra, peak counts are expanded based on integration values to ensure physically grounded comparisons. d, Radar charts show the top-$k$ ($k=1, 5, 10$) molecular generation accuracy of NMRPeak-G using simulated and experimental spectra across six input configurations. For all panels, C and H correspond to $^{13}$C and $^{1}$H NMR peaks, and F denotes molecular formula constraints.
  • Figure 4: Performance of NMRPeak-R.a, b, Impact of database scale on retrieval accuracy for molecule-to-spectrum (a) and spectrum-to-molecule (b) tasks using a single contrastive learning strategy. Results are reported for database sizes ranging from 100 entries to the full experimental set ($\approx$100k entries). c, Comprehensive retrieval performance on the full-scale benchmark across diverse input modalities. The left panel shows contrastive learning-based molecule-to-spectrum retrieval and the right panel displays spectrum-to-molecule retrieval results using multi-dimensional fusion strategy across six input configurations. For all panels, C and H correspond to $^{13}$C and $^{1}$H NMR peaks, and F denotes molecular formula constraints.
  • Figure 5: Ablation study and weighting analysis of NMRPeak-R.a, Comparison of retrieval accuracies for different variants across various input modalities. b, Ternary plots illustrating the impact of weighting strategies on top-1 retrieval accuracy for diverse configurations. For all panels, SME, SSE, and SSR denote spectrum-to-molecule embedding, spectrum-to-spectrum embedding, and spectrum-to-spectrum rule-based similarity, respectively. C and H correspond to $^{13}$C and $^{1}$H NMR peaks, and F denotes molecular formula constraints.
  • ...and 4 more figures