Table of Contents
Fetching ...

2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision

Yunrui Li, Hao Xu, Pengyu Hong

TL;DR

We address the scarcity of large-scale annotated experimental HSQC data for atom-level NMR analysis by introducing 2DNMRGym, a dataset with over 22k HSQC spectra linked to SMILES and molecular graphs, and a two-tier annotation scheme using silver-standard algorithmic labels for training and gold-standard expert labels for evaluation. The authors curate 22,157 HSQC spectra from HMDB/CH-NMR-NP, augment graphs with RDKit-derived features, and apply a surrogate supervision framework where cross peaks map to specific C--H bonds within a molecular graph. Benchmarking across 2D/3D GNNs and a GNN-Transformer reveals that 2D GIN-based models perform best for atom-level HSQC shift prediction, while transformer components generally help, though 3D models show mixed results due to conformer issues. The dataset and code are open-source, enabling scalable atom-level NMR learning and setting the stage for future extensions to additional NMR modalities such as HMBC and COSY for broader structural characterization.

Abstract

Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model's ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available on Huggingface and Github.

2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision

TL;DR

We address the scarcity of large-scale annotated experimental HSQC data for atom-level NMR analysis by introducing 2DNMRGym, a dataset with over 22k HSQC spectra linked to SMILES and molecular graphs, and a two-tier annotation scheme using silver-standard algorithmic labels for training and gold-standard expert labels for evaluation. The authors curate 22,157 HSQC spectra from HMDB/CH-NMR-NP, augment graphs with RDKit-derived features, and apply a surrogate supervision framework where cross peaks map to specific C--H bonds within a molecular graph. Benchmarking across 2D/3D GNNs and a GNN-Transformer reveals that 2D GIN-based models perform best for atom-level HSQC shift prediction, while transformer components generally help, though 3D models show mixed results due to conformer issues. The dataset and code are open-source, enabling scalable atom-level NMR learning and setting the stage for future extensions to additional NMR modalities such as HMBC and COSY for broader structural characterization.

Abstract

Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model's ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available on Huggingface and Github.

Paper Structure

This paper contains 35 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The 2DNMRGym dataset comprises multi-modal components, including the SMILES representation of each molecule and its conversion to a molecular graph. This graph includes both 2D topological structures and Cartesian coordinates for 3D spatial information. The ground truth spectrum is represented as cross peak tables, where the "Carbon Index" maps to the corresponding carbons in the molecular topology graph.
  • Figure 2: Data statistics by number of atoms, molecular weight, and tanimoto similarity.
  • Figure 3: Scaffold analysis for training and test dataset.
  • Figure 4: A demonstration workflow using 2DNMRGym dataset to train GNN models. The learnt graph representation from these benchmark models can be evaluated in the downstream HSQC cross peak prediction task.
  • Figure 5: An annotation example. To avoid overcrowded, only a few "C-H bond -- peak" associations are shown. For a large molecule with complex structure like this, aligning the chemical bonds with the cross peaks is extremely difficult due to signal overlap and degeneracy. The bottom-right of the HSQC spectrum shows a 3D abstract skeleton of the molecule.