Table of Contents
Fetching ...

De novo molecular structure elucidation from mass spectra via flow matching

Ghaith Mqawass, Tuan Le, Fabian Theis, Djork-Arné Clevert

TL;DR

MSFlow is developed, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules and can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations.

Abstract

Mass spectrometry is a powerful and widely used tool for identifying molecular structures due to its sensitivity and ability to profile complex samples. However, translating spectra into full molecular structures is a difficult, under-defined inverse problem. Overcoming this problem is crucial for enabling biological insight, discovering new metabolites, and advancing chemical research across multiple fields. To this end, we develop MSFlow, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules. In the first stage, we adopt a formula-restricted transformer model for encoding mass spectra into a continuous and chemically informative embedding space, while in the second stage, we train a decoder flow matching model to reconstruct molecules from latent embeddings of mass spectra. We present ablation studies demonstrating the importance of using information-preserving molecular descriptors for encoding mass spectra and motivate the use of our discrete flow-based decoder. Our rigorous evaluation demonstrates that MSFlow can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations - an improvement of up to fourteen-fold over the current state-of-the-art. A trained version of MSFlow is made publicly available on GitHub for non-commercial users.

De novo molecular structure elucidation from mass spectra via flow matching

TL;DR

MSFlow is developed, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules and can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations.

Abstract

Mass spectrometry is a powerful and widely used tool for identifying molecular structures due to its sensitivity and ability to profile complex samples. However, translating spectra into full molecular structures is a difficult, under-defined inverse problem. Overcoming this problem is crucial for enabling biological insight, discovering new metabolites, and advancing chemical research across multiple fields. To this end, we develop MSFlow, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules. In the first stage, we adopt a formula-restricted transformer model for encoding mass spectra into a continuous and chemically informative embedding space, while in the second stage, we train a decoder flow matching model to reconstruct molecules from latent embeddings of mass spectra. We present ablation studies demonstrating the importance of using information-preserving molecular descriptors for encoding mass spectra and motivate the use of our discrete flow-based decoder. Our rigorous evaluation demonstrates that MSFlow can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations - an improvement of up to fourteen-fold over the current state-of-the-art. A trained version of MSFlow is made publicly available on GitHub for non-commercial users.
Paper Structure (10 sections, 4 equations, 5 figures, 1 table)

This paper contains 10 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Given a mass spectrum of a natural product, like the spectrum of caffeine, the task is to identify the corresponding structure of that product. This task is referred to as the "inverse problem" and can be solved by de novo generation.
  • Figure 2: Overview of our method. A) The input mass spectrum is encoded using MIST MIST into an intermediate representation $Y$. In our work, we chose CDDDcddd to be the intermediate representation. B) The architecture of our flow decoder. It has a BERT-like architecture with bidirectional attention. Instead of using the standard LayerNorm, we use an adaptive LayerNorm. The model samples from a uniform distribution over tokens and uses the condition $Y$ through its adaptive layer norm to guide the generation of structures aiming at reconstructing the true molecule. C) This panel shows the overview of our two-stage approach. First, the mass spectrum is encoded into a latent representation. Then, our conditional decoder takes the latent representation as an input condition to generate the corresponding molecule.
  • Figure 3: Molecular property distributions across CANOPUS (CPS) and MassSpecGym (MSG) datasets. (A) Atom count distributions showing minimal shift in CANOPUS ($\Delta\mu=-0.1$ atoms) versus substantial shift in MassSpecGym ($\Delta\mu=+7.2$ atoms). (B) Rotatable bond distributions indicating slightly increased flexibility in CANOPUS and MassSpecGym test sets splits ($\Delta\mu=+0.4$bonds) and ($\Delta\mu=+0.3$ bonds) respectively. Solid lines: training data; dashed lines: test data.
  • Figure 4: Panels A, B, and C show top-1 accuracy, top-1 Tanimoto similarity, and top-1 MCES distance, respectively, as functions of molecular size (number of atoms) for the MassSpecGym benchmark dataset. Higher values indicate better performance for accuracy and Tanimoto similarity, while lower values are better for MCES distance. Panel D displays the reconstruction accuracy as a function of molecular flexibility (number of rotatable bonds). Given that reconstruction accuracy degrades with increasing molecular size, we restricted the analysis to molecules containing 20--25 atoms. The legend in Panel A applies to all other panels.
  • Figure 5: Predictions of our model on random MassSpecGym test samples. Left to right, MSFlow succeeds in reconstructing the first two ground truth molecules in the top-1 and top-10 predictions, respectively, but fails on the last three examples.