Table of Contents
Fetching ...

DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models

Liang Wang, Yu Rong, Tingyang Xu, Zhenyi Zhong, Zhiyuan Liu, Pengju Wang, Deli Zhao, Qiang Liu, Shu Wu, Liang Wang, Yang Zhang

TL;DR

DiffSpectra reframes molecular structure elucidation from spectra as a conditional generation task using diffusion models that jointly produce 2D topology and 3D geometry. It introduces the Diffusion Molecule Transformer (DMT) for SE(3)-equivariant denoising and SpecFormer for multi-modal spectral conditioning, enabling de novo structure elucidation from UV–Vis, IR, and Raman spectra. The framework achieves strong top-1 and top-10 accuracies (about $40.8\%$ and $99.5\%$, respectively) and high 3D fidelity, with notable gains from pre-trained spectral encoders and multi-modal conditioning; sampling multiple candidates yields near-exhaustive coverage of the true structure space. Gradient-based trajectory analysis reveals a staged generation process, and the approach generalizes across molecules of varying size, suggesting practical utility for open-ended discovery and downstream validation, while outlining future extensions to additional spectroscopies and larger systems.

Abstract

Molecular structure elucidation from spectra is a fundamental challenge in molecular science. Conventional approaches rely heavily on expert interpretation and lack scalability, while retrieval-based machine learning approaches remain constrained by limited reference libraries. Generative models offer a promising alternative, yet most adopt autoregressive architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that formulates molecular structure elucidation as a conditional generation process, directly inferring 2D and 3D molecular structures from multi-modal spectra using diffusion models. Its denoising network is parameterized by the Diffusion Molecule Transformer, an SE(3)-equivariant architecture for geometric modeling, conditioned by SpecFormer, a Transformer-based spectral encoder capturing multi-modal spectral dependencies. Extensive experiments demonstrate that DiffSpectra accurately elucidates molecular structures, achieving 40.76% top-1 and 99.49% top-10 accuracy. Its performance benefits substantially from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. To our knowledge, DiffSpectra is the first framework that unifies multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.

DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models

TL;DR

DiffSpectra reframes molecular structure elucidation from spectra as a conditional generation task using diffusion models that jointly produce 2D topology and 3D geometry. It introduces the Diffusion Molecule Transformer (DMT) for SE(3)-equivariant denoising and SpecFormer for multi-modal spectral conditioning, enabling de novo structure elucidation from UV–Vis, IR, and Raman spectra. The framework achieves strong top-1 and top-10 accuracies (about and , respectively) and high 3D fidelity, with notable gains from pre-trained spectral encoders and multi-modal conditioning; sampling multiple candidates yields near-exhaustive coverage of the true structure space. Gradient-based trajectory analysis reveals a staged generation process, and the approach generalizes across molecules of varying size, suggesting practical utility for open-ended discovery and downstream validation, while outlining future extensions to additional spectroscopies and larger systems.

Abstract

Molecular structure elucidation from spectra is a fundamental challenge in molecular science. Conventional approaches rely heavily on expert interpretation and lack scalability, while retrieval-based machine learning approaches remain constrained by limited reference libraries. Generative models offer a promising alternative, yet most adopt autoregressive architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that formulates molecular structure elucidation as a conditional generation process, directly inferring 2D and 3D molecular structures from multi-modal spectra using diffusion models. Its denoising network is parameterized by the Diffusion Molecule Transformer, an SE(3)-equivariant architecture for geometric modeling, conditioned by SpecFormer, a Transformer-based spectral encoder capturing multi-modal spectral dependencies. Extensive experiments demonstrate that DiffSpectra accurately elucidates molecular structures, achieving 40.76% top-1 and 99.49% top-10 accuracy. Its performance benefits substantially from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. To our knowledge, DiffSpectra is the first framework that unifies multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.

Paper Structure

This paper contains 46 sections, 48 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview and architecture of the DiffSpectra framework.a, Schematic of the diffusion-based generative framework underlying DiffSpectra, comprising a continuous-time forward diffusion process and a reverse-time denoising process. The denoising network is implemented as the Diffusion Molecule Transformer (DMT), while spectral features encoded by SpecFormer provide conditional guidance along the denoising trajectory. b, Architecture of DMT, which jointly processes and denoises node features, edge features, and atomic coordinates through three parallel streams within a stack of equivariant blocks. The streams are interconnected by a relational multi-head attention mechanism and equivariant geometric updates, and are conditioned on shared spectral and time information. c, Architecture and pre-training strategy of SpecFormer, a unified Transformer encoder for multi-modal spectroscopic data (UV–Vis, IR, and Raman). SpecFormer is pre-trained with masked-patch reconstruction (MPR) and contrastive learning against corresponding 3D molecular structures.
  • Figure 2: Quantitative comparison of DiffSpectra and its variants on molecular structure elucidation tasks.a, Structure elucidation performance of DiffSpectra under different configurations. We compared the performance when using a pre-trained SpecFormer as the spectral condition encoder versus an un-pretrained SpecFormer, as well as when using multi-modal spectra versus individual spectra (IR, Raman, UV-Vis). Reported metrics include exact top-1 accuracy, MCES, and several similarity-based metrics. b, Evaluation of elucidated 3D molecular structures. For each generated 3D structure, we performed atom-to-atom mapping against the corresponding ground-truth 3D structure and further computed the RMSD. When using a pre-trained SpecFormer with multi-modal spectra as input, DiffSpectra achieves the best elucidation performance, showing the highest atom mapping accuracy and the lowest RMSD. c, Accuracy@$K$ with an increasing number of sampled candidates. We report top-$K$ accuracy as the number of generated candidates $K$ increases. Across all settings, Accuracy@$K$ consistently improves with larger $K$, confirming that multiple sampling significantly increases the likelihood of recovering the correct molecular structure. d, Distribution of similarity-based metrics on the test set, illustrated using violin and box plots. The red line in each box plot denotes the median. e and f, Occurrence of functional groups in the test molecules and the corresponding elucidation performance of DiffSpectra. To account for the imbalance in functional group occurrence, we computed the ROC-AUC score for each group based on whether DiffSpectra correctly identified its presence or absence in each molecule.
  • Figure 3: Visualization of molecular structure elucidation results using DiffSpectra under different configurations. We compared single-spectrum inputs (IR, Raman, UV-Vis), multi-modal spectra, and the effect of the pre-trained SpecFormer. Ground-truth structures are shown on the left for reference.
  • Figure 4: Molecular structure elucidation performance across molecules with varying numbers of atoms. The figure presents results obtained using different similarity metrics, including $\operatorname{TaniSim}_{\mathrm{MG}}$, $\operatorname{CosSim}_{\mathrm{MG}}$, $\operatorname{TaniSim}_{\mathrm{MA}}$, FraggleSim, and FGSim.
  • Figure 5: Gradient-based elucidation trajectory analysis. Gradient heatmaps depict atomic gradients with respect to spectral patches, with each heatmap averaged over a time interval of 0.05 (corresponding to 50 denoising steps). Upon completion of the denoising process, a molecular structure is elucidated that matches the ground-truth structure. Atom indices are annotated on the structure.