Table of Contents
Fetching ...

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

Gabriel Asher, Devesh Shah, Amy A. Caudy, Luke Ferro, Lea Amar, Ana S. H. Costa, Thomas Patton, Niall O'Connor, Jennifer M. Campbell, Jack Geremia

TL;DR

LSM-MS2 introduces a transformer-based foundation model trained on millions of MS/MS spectra to create a semantic space for chemical interpretation. It achieves state-of-the-art spectral identification, notably improving isomer discrimination and maintaining robustness at low concentrations across diverse benchmarks. Beyond retrieval, the learned embeddings enable direct biological interpretation, differentiating disease states and predicting clinical outcomes from limited downstream data. These results suggest substantial practical impact for accelerating metabolomics discovery and translational research with minimal task-specific data.

Abstract

A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.

LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation

TL;DR

LSM-MS2 introduces a transformer-based foundation model trained on millions of MS/MS spectra to create a semantic space for chemical interpretation. It achieves state-of-the-art spectral identification, notably improving isomer discrimination and maintaining robustness at low concentrations across diverse benchmarks. Beyond retrieval, the learned embeddings enable direct biological interpretation, differentiating disease states and predicting clinical outcomes from limited downstream data. These results suggest substantial practical impact for accelerating metabolomics discovery and translational research with minimal task-specific data.

Abstract

A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.

Paper Structure

This paper contains 33 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Comparison of model performance on the MWX-Isomers dataset. LSM-MS2 significantly outperforms previous methods on this dataset, in both a cumulative per-analyte stance, as well as per isomer groups. (a) Overall per-analyte identification accuracy across all 61 biological isomers. In cases of tied top-1 accuracy for an analyte, each model achieving the tie receives one point, resulting in a cumulative total exceeding 61. (b) Per-group distribution of top-1 accuracies across all 22 isomeric groups in the dataset.
  • Figure 2: Top-1 identification accuracy across three biologically important isomer groups in the MWX-Isomers dataset. Balanced performance across all isomer pairs is critical for true isomeric discrimination, a task in which LSM-MS2 outperforms previous methods.
  • Figure 3: Global identification performance of true positives (left) and spurious hits (right) comparing Cosine Similarity and LSM-MS2 across different score thresholds on the NIST Dilution Series.
  • Figure 4: True positive identifications (left) and precision (right) for both models at varying dilution factors. Metrics are evaluated at the score threshold that maximizes F1 for each scoring method-dilution pair.
  • Figure 5: Unsupervised UMAP projections of study samples colored by mortality cause. Precursor baseline embeddings (left) show limited separation between CPZ/PER and OLA/CLO, whereas LSM-MS2 embeddings (right) reveal clear distinction across all fatality types.
  • ...and 11 more figures