Table of Contents
Fetching ...

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S. Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J. J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal

TL;DR

MassSpecGym delivers the first large-scale, public benchmark for discovering and identifying molecules from MS/MS spectra. It consolidates the largest labeled MS/MS dataset to date, enforces a generalization-demanding MCES-based data split, and introduces three concrete annotation tasks: de novo generation, molecule retrieval, and spectrum simulation, each with tailored baselines and new evaluation metrics. The framework standardizes data processing (spectrum/instrument/CE normalization, PubChem SMILES standardization) and aggregates auxiliary unlabeled datasets to support learning from large unlabeled corpora. This benchmark enables rigorous cross-method comparisons, facilitates community-wide improvements in MS/MS annotation, and provides practical pathways for real-world molecule discovery in biology and environmental chemistry. Overall, MassSpecGym paves the way for robust, transferable ML approaches to MS/MS interpretation, potentially accelerating discovery and deepening understanding of biochemical processes.

Abstract

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

MassSpecGym: A benchmark for the discovery and identification of molecules

TL;DR

MassSpecGym delivers the first large-scale, public benchmark for discovering and identifying molecules from MS/MS spectra. It consolidates the largest labeled MS/MS dataset to date, enforces a generalization-demanding MCES-based data split, and introduces three concrete annotation tasks: de novo generation, molecule retrieval, and spectrum simulation, each with tailored baselines and new evaluation metrics. The framework standardizes data processing (spectrum/instrument/CE normalization, PubChem SMILES standardization) and aggregates auxiliary unlabeled datasets to support learning from large unlabeled corpora. This benchmark enables rigorous cross-method comparisons, facilitates community-wide improvements in MS/MS annotation, and provides practical pathways for real-world molecule discovery in biology and environmental chemistry. Overall, MassSpecGym paves the way for robust, transferable ML approaches to MS/MS interpretation, potentially accelerating discovery and deepening understanding of biochemical processes.

Abstract

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

Paper Structure

This paper contains 59 sections, 6 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: MassSpecGym enables a standardized and user-friendly evaluation of machine learning methods for MS/MS annotation via an easily extendable modular interface. The dataset can be loaded, preprocessed, split, and utilized for training, evaluation, and metric logging through the prepared codebase. To develop and evaluate a new model, a user only needs to implement a forward pass with custom prediction logic. Colored blocks represent classes in our codebase (https://github.com/pluskal-lab/MassSpecGym). Arrows with empty heads represent subclass inheritance, while arrows with bold heads conceptually show the flow from the dataset to the evaluation metrics.
  • Figure 2: Same measured molecule may result in different MS/MS spectra under different mass spectrometry measurement conditions.a, The distribution of the number of spectra corresponding to the same molecule in the MassSpecGym dataset. b, Example of a molecule annotating 321 spectra in the dataset. c, Examples of six spectra annotated with the molecule shown in figure b: different instrument types and collision energies lead to different spectra. Higher collision energies typically lead to richer fragmentation of a molecule, resulting in a higher number of peaks in the spectrum.
  • Figure 3: MCES-based metadata-stratified data split results in a balanced composition of metadata across data folds.a, Example of an MCES cluster of size six. b, Number of spectra in each fold in the dataset with respect to different metadata properties. c, Same as b, but for the subset of the dataset with no missing metadata. This subset is used for the spectrum simulation challenge.
  • Figure 4: TODO
  • Figure 6: MCES-based metadata-stratified data splitting results in a balanced distribution of chemical classes across data folds. The figure presents a histogram of the 50 most common chemical classes according to ClassyFire feunang2016classyfire, found in MassSpecGym, with a separate bar for all less common classes, labeled as "Other (254 classes)". The box with an arrow pointing to this bar shows the number of classes uniquely present in individual folds, along with the number of underlying molecules.
  • ...and 4 more figures