Table of Contents
Fetching ...

FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra

Jianan Nie, Peng Gao

Abstract

Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder. Notably, it achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). We also visualize the generated molecules, which further demonstrate that FlowMS produces structurally plausible candidates closely resembling ground truth structures. These results establish discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery.

FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra

Abstract

Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder. Notably, it achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). We also visualize the generated molecules, which further demonstrate that FlowMS produces structurally plausible candidates closely resembling ground truth structures. These results establish discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery.
Paper Structure (30 sections, 8 equations, 4 figures, 1 table)

This paper contains 30 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of our discrete flow matching framework for mass spectrum-guided molecular generation. Given a tandem mass spectrum and molecular formula, the spectrum encoder produces a conditioning fingerprint. The discrete flow matching generator produces candidate molecular structures, which are subsequently ranked by spectral frequency.
  • Figure 2: Generated molecules on representative NPLIB1 samples, with the ground truth structure shown in the left column and the FlowMS predictions in the right columns. Tanimoto similarity scores and Maximum Common Edge Substructure (MCES) values are indicated below each prediction.
  • Figure 3: Positive test examples on the NPLIB1 dataset duhrkop2021systematic, where FlowMS correctly identifies the target molecule within top-1 predictions. Ground truth molecules (left column) and FlowMS predictions (right columns).
  • Figure 4: Negative test sample from the NPLIB1 dataset duhrkop2021systematic, where FlowMS does not recover the exact ground truth structure in top-1 predictions. Ground truth molecules (left column) and FlowMS predictions (right columns).