Table of Contents
Fetching ...

One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

Neng Kai Nigel Neo, Lim Jing, Ngoui Yong Zhau Preston, Koh Xue Ting Serene, Bingquan Shen

TL;DR

This work tackles de novo molecule generation from mass spectra by rethinking the two-stage pipeline that encodes spectra into fingerprints and decodes those fingerprints into structures. The authors combine a mass-spectra encoder (MIST) with a thresholded fingerprint representation and a transformer-based fingerprint decoder (MolForge), trained on a large external dataset to improve generalization. They demonstrate a roughly tenfold improvement over prior state-of-the-art, achieving top-1 accuracy around $31\%$ and top-10 accuracy around $40\%$ on MassSpecGym, with gains enhanced by a prior-adjusted threshold ($t=0.172$) for selecting active fingerprint bits. The results indicate that decoder capacity benefits significantly from extensive training data and that the remaining bottleneck lies in accurate fingerprint prediction from mass spectra, establishing a strong baseline and guiding future work toward improving spectra-to-fingerprint inference and data scaling.

Abstract

A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.

One Small Step with Fingerprints, One Giant Leap for De Novo Molecule Generation from Mass Spectra

TL;DR

This work tackles de novo molecule generation from mass spectra by rethinking the two-stage pipeline that encodes spectra into fingerprints and decodes those fingerprints into structures. The authors combine a mass-spectra encoder (MIST) with a thresholded fingerprint representation and a transformer-based fingerprint decoder (MolForge), trained on a large external dataset to improve generalization. They demonstrate a roughly tenfold improvement over prior state-of-the-art, achieving top-1 accuracy around and top-10 accuracy around on MassSpecGym, with gains enhanced by a prior-adjusted threshold () for selecting active fingerprint bits. The results indicate that decoder capacity benefits significantly from extensive training data and that the remaining bottleneck lies in accurate fingerprint prediction from mass spectra, establishing a strong baseline and guiding future work toward improving spectra-to-fingerprint inference and data scaling.

Abstract

A common approach to the de novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt MIST (Goldman et. al., 2023) as the encoder and MolForge (Ucak et. al., 2023) as the decoder, leveraging additional training data to enhance performance. We also threshold the probabilities of each fingerprint bit to focus on the presence of substructures. This results in a tenfold improvement over previous state-of-the-art methods, generating top-1 31% / top-10 40% of molecular structures correctly from mass spectra in MassSpecGym (Bushuiev et. al., 2024). We position this as a strong baseline for future research in de novo molecule elucidation from mass spectra.

Paper Structure

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our proposed pipeline of using MIST as a mass spectrum encoder, thresholding the fingerprint, and using MolForge as a fingerprint decoder for the de novo molecule generation problem.
  • Figure 2: False positive and negative bits across different threshold values when applied to MIST fingerprints.
  • Figure 3: Tanimoto Similarity of thresholded MIST fingerprint and Top-1 structure from MolForge, both compared to the ground truth fingerprint. Orange points indicate an exact structural match, and points above (below) the green parity line show predicted structures that are better (worse) than the thresholded fingerprint.