Table of Contents
Fetching ...

DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Montgomery Bohde, Mrunali Manjrekar, Runzhong Wang, Shuiwang Ji, Connor W. Coley

TL;DR

DiffMS tackles de novo structure elucidation from mass spectra by conditioning molecular graph generation on a known chemical formula using a discrete graph diffusion decoder and a transformer-based spectrum encoder. It introduces a formula-constrained diffusion framework and a two-stage pretraining strategy: encoder pretraining to predict fingerprints from spectra and decoder pretraining on millions of fingerprint–molecule pairs, followed by end-to-end finetuning. Empirical results on NPLIB1 and MassSpecGym show state-of-the-art performance across Top-1/Top-10 accuracy and structural similarity, with ablations confirming the benefits of pretraining and formula inference. The approach enables scalable, MS-conditioned molecule generation with strong chemical validity, advancing mass-spectrometry-driven discovery.

Abstract

Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.

DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

TL;DR

DiffMS tackles de novo structure elucidation from mass spectra by conditioning molecular graph generation on a known chemical formula using a discrete graph diffusion decoder and a transformer-based spectrum encoder. It introduces a formula-constrained diffusion framework and a two-stage pretraining strategy: encoder pretraining to predict fingerprints from spectra and decoder pretraining on millions of fingerprint–molecule pairs, followed by end-to-end finetuning. Empirical results on NPLIB1 and MassSpecGym show state-of-the-art performance across Top-1/Top-10 accuracy and structural similarity, with ablations confirming the benefits of pretraining and formula inference. The approach enables scalable, MS-conditioned molecule generation with strong chemical validity, advancing mass-spectrometry-driven discovery.

Abstract

Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.

Paper Structure

This paper contains 25 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: De novo structure generation from LC-MS/MS faces ambiguity when isobaric or isomeric compounds yield similar fragmentation spectra. In this case, the experimental spectra for leucine and isoleucine from nist_database are essentially indistinguishable. It is one of many examples demonstrating that the identification of the exact structure is desirable but challenging.
  • Figure 2: DiffMS tackles de novo molecular generation from mass spectra. We embed mass spectrum features with a transformer encoder, and assume the chemical formula is determined by off-the-shelf tools goldman2023mist-cfbocker2016fragmentation so that the numbers and types of heavy atoms (i.e. nodes in the molecular graph) is constrained. The molecular structure is represented as an adjacency matrix with one-hot encoded bond types, which in this example are single (blue), double (yellow), aromatic bonds (red) and no bond (white). The target molecular structure is generated starting from a randomly initialized adjacency matrix, which is denoised through a discrete diffusion process vignac2023digress. The trajectory used for training is created by randomly disturbing the true structure $t$ times.
  • Figure 3: Model architecture of DiffMS. A) The spectrum encoder first assigns chemical formulae to peaks in an experimental spectrum and then learns an embedding vector through a formula transformer. The encoder is pretrained to predict Morgan fingerprints morgan1965generation from spectra. B) The graph decoder generates the target adjacency matrix by discrete diffusion conditioned on the spectrum embedding and node (atom) features. The graph decoder is pretrained with pairs of structures and fingerprints from virtual chemical libraries. We scale up the decoder pretraining to exploit the virtually-infinite number of available fingerprint-structure pairs relative to the small number of available spectrum-structure pairs, mitigating the challenge of fingerprint-to-molecule generation found non-trivial by le2020neuraldecipher. C) DiffMS integrates the spectrum encoder and graph decoder to generate the structure annotation as a denoising process applied to a graph with randomly generated edges. It is finetuned end-to-end on labeled molecule-spectrum data.
  • Figure 4: Ground truth molecules (left column) and DiffMS predictions (right columns) on test samples from the MassSpecGym dataset bushuiev2024massspecgymbenchmarkdiscoveryidentification. Tanimoto similarity and MCES metrics listed for each top-$k$ prediction. From top to bottom, the spectra IDs are MassSpecGymID0205184, MassSpecGymID0052933, MassSpecGymID0382596, and MassSpecGymID0152454. The top two rows show cases where DiffMS successfully reconstructs the true molecule in the top-1 prediction. In the bottom two rows, DiffMS does not reconstruct the correct molecule. Additional examples can be found in Appendix \ref{['appendix:molecules']}.
  • Figure 5: NPLIB1 top-$k$ accuracy for DiffMS pretrained on increasingly large fingerprint-to-molecule datasets. Additional metrics available in Table \ref{['table:ablate_decoder']} in the Appendix.
  • ...and 6 more figures