Table of Contents
Fetching ...

Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

Yiwen Zhang, Keyan Ding, Yihang Wu, Xiang Zhuang, Yi Yang, Qiang Zhang, Huajun Chen

TL;DR

MS/MS-to-molecule retrieval suffers from limited spectral library coverage and cross-modal misalignment between spectra and chemical structures. GLMR addresses this by a two-stage approach: a pre-retrieval cross-modal contrastive alignment to select contextual candidate molecules, followed by a context-aware generative retrieval that refines a molecule aligned with the input spectrum and priors, enabling re-ranking by molecular similarity. The framework uses a ChemFormer-based molecular encoder and a spectrum encoder to learn aligned representations via an Info-NCE loss, then fuses spectrum and priors with cross-attention in a ChemFormer decoder to generate refined SMILES. The authors introduce MassRET-20k to test generalization beyond MassSpecGym and show GLMR achieves over 40% top-1 accuracy improvements and strong transfer performance, highlighting the practicality of bridging modality gaps with generative retrieval for library-free compound identification in mass spectrometry.

Abstract

Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.

Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

TL;DR

MS/MS-to-molecule retrieval suffers from limited spectral library coverage and cross-modal misalignment between spectra and chemical structures. GLMR addresses this by a two-stage approach: a pre-retrieval cross-modal contrastive alignment to select contextual candidate molecules, followed by a context-aware generative retrieval that refines a molecule aligned with the input spectrum and priors, enabling re-ranking by molecular similarity. The framework uses a ChemFormer-based molecular encoder and a spectrum encoder to learn aligned representations via an Info-NCE loss, then fuses spectrum and priors with cross-attention in a ChemFormer decoder to generate refined SMILES. The authors introduce MassRET-20k to test generalization beyond MassSpecGym and show GLMR achieves over 40% top-1 accuracy improvements and strong transfer performance, highlighting the practicality of bridging modality gaps with generative retrieval for library-free compound identification in mass spectrometry.

Abstract

Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.

Paper Structure

This paper contains 38 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustration of the methods for MS-to-Molecule retrieval. (a) Spectral library matching method, where the input mass spectrum is compared against the reference MS of characterized compounds in a database. (b) Cross-modal representation alignment method, where both mass spectra and molecular structures are encoded into a potentially aligned latent space. (c) Our method, which builds upon cross-modal representation alignment, further incorporates a context-aware molecule generator for generative retrieval.
  • Figure 2: Overview of the proposed GLMR method. (a) Training process of modality alignment. We optimize a contrastive loss that encourages mutual alignment between the molecular and spectral modalities. (b) Inference process of pre-retrieval. We use the learned encoders to rank candidate molecules in the retrieval database. The top-$K$ molecules with the highest similarity scores are selected as the output of the pre-retrieval stage. (c) Training process of generative language models. We leverage a generative language model conditioned on both the input mass spectrum and the prior candidate molecules to produce refined molecular structures. (d) Inference process of generative retrieval. We use the generated molecule to re-rank the pre-retrieved molecules based on molecular similarity.
  • Figure 3: The modality gap distributions on MassSpecGym. A smaller MG indicates better alignment between MS/MS spectra and molecules.
  • Figure 4: Performance trends with varying $K$. Experiments are conducted on the weight-based retrieval library of MassSpecGym.
  • Figure A1: The types and proportions of ionized adducts in (a) MassRET-20k and (b) MassSpecGym.
  • ...and 1 more figures