MADGEN: Mass-Spec attends to De Novo Molecular generation
Yinkai Wang, Xiaohui Chen, Liping Liu, Soha Hassoun
TL;DR
MADGEN tackles the challenging problem of annotating MS/MS spectra by introducing a scaffold-based two-stage pipeline: first retrieve a core scaffold from the spectrum, then generate the full molecule conditioned on the scaffold and the spectrum. The scaffold retrieval is implemented via predictive contrastive learning and an oracle lookup to bound performance, while the second stage employs a Markov-bridge graph generator with classifier-free spectrum guidance to produce molecule graphs that align with spectral evidence. Across NIST23, CANOPUS, and MassSpecGym, the approach demonstrates meaningful gains when the scaffold is known (oracle) and provides insight into the scaffold prediction bottleneck, with potential improvements from richer data and larger scaffolds. Overall, MADGEN offers a principled, interpretable path toward de novo MS/MS annotation by coupling backbone scaffolds with spectrum-informed generation, with practical impact for metabolomics and related fields.
Abstract
The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the "dark chemical space" without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.
