Table of Contents
Fetching ...

MADGEN: Mass-Spec attends to De Novo Molecular generation

Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun

TL;DR

MADGEN tackles the challenging problem of annotating MS/MS spectra by introducing a scaffold-based two-stage pipeline: first retrieve a core scaffold from the spectrum, then generate the full molecule conditioned on the scaffold and the spectrum. The scaffold retrieval is implemented via predictive contrastive learning and an oracle lookup to bound performance, while the second stage employs a Markov-bridge graph generator with classifier-free spectrum guidance to produce molecule graphs that align with spectral evidence. Across NIST23, CANOPUS, and MassSpecGym, the approach demonstrates meaningful gains when the scaffold is known (oracle) and provides insight into the scaffold prediction bottleneck, with potential improvements from richer data and larger scaffolds. Overall, MADGEN offers a principled, interpretable path toward de novo MS/MS annotation by coupling backbone scaffolds with spectrum-informed generation, with practical impact for metabolomics and related fields.

Abstract

The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.

MADGEN: Mass-Spec attends to De Novo Molecular generation

TL;DR

MADGEN tackles the challenging problem of annotating MS/MS spectra by introducing a scaffold-based two-stage pipeline: first retrieve a core scaffold from the spectrum, then generate the full molecule conditioned on the scaffold and the spectrum. The scaffold retrieval is implemented via predictive contrastive learning and an oracle lookup to bound performance, while the second stage employs a Markov-bridge graph generator with classifier-free spectrum guidance to produce molecule graphs that align with spectral evidence. Across NIST23, CANOPUS, and MassSpecGym, the approach demonstrates meaningful gains when the scaffold is known (oracle) and provides insight into the scaffold prediction bottleneck, with potential improvements from richer data and larger scaffolds. Overall, MADGEN offers a principled, interpretable path toward de novo MS/MS annotation by coupling backbone scaffolds with spectrum-informed generation, with practical impact for metabolomics and related fields.

Abstract

The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.
Paper Structure (36 sections, 13 equations, 3 figures, 3 tables)

This paper contains 36 sections, 13 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: MADGEN overview and example. (a) The overview of MADGEN. The mass spectra are used to rank scaffold candidates through contrastive learning. The top-ranked scaffold, with blue edges fixed, serves as a foundation for de novo molecule generation, guided by the spectra at each generation step. (b) Examples of molecular generation process over time steps for Kenalog from CANOPUS dataset (upper) and 2,6-Dinitro-4-(4-nitrophenyl)phenol from NIST23 dataset (lower). The scaffolds remain fixed, while additional edges are introduced in each step to connect free atoms to scaffolds. The complete molecules are shown in step 40.
  • Figure 2: Overview of the MADGEN model framework. The input consists of m/z peaks and intensities $(m, I)$, which are passed through an MLP for embedding. These embeddings are processed through self-attention and combined with the molecular graph's node and edge embeddings via cross-attention. The node and edge embeddings are updated iteratively using an edge-aware message-passing neural network(MPNN) and fully-connected graph neural network (FC-GNN) layers. The final molecular structure is sampled after the last time step via a logit layer, aligning with the mass spectral data.
  • Figure 3: Accuracy vs Number of Free Atoms: With more free atoms for MADGEN to connect to the scaffold, the complexity of the generative trajectory increases, leading to a worse predictive accuracy.