Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Yonghan Yu; Ming Li

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Yonghan Yu, Ming Li

TL;DR

DeepSearch presents the first deep learning-based end-to-end database search for tandem MS that replaces ion-to-ion scoring with cross-modal cosine similarity between spectrum and peptide embeddings. Trained with a contrastive learning framework on MassIVE v2 data, the transformer-based encoder-decoder jointly learns peptide inference and PSM re-ranking while enabling zero-shot PTM profiling by PTM-mass shifting. The method reduces scoring bias and demonstrates robustness across species, achieving competitive or superior PSM identification rates without relying on statistical estimation, and enabling zero-shot PTM profiling on phosphorylation-enriched data with substantial agreement to established tools. These results suggest a scalable, data-driven alternative to traditional heuristic database search engines in proteomics, with practical implications for unbiased peptide identification and PTM analysis.

Abstract

Peptide identification in mass spectrometry-based proteomics is crucial for understanding protein function and dynamics. Traditional database search methods, though widely used, rely on heuristic scoring functions and statistical estimations have to be introduced for a higher identification rate. Here, we introduce DeepSearch, the first deep learning-based end-to-end database search method for tandem mass spectrometry. DeepSearch leverages a modified transformer-based encoder-decoder architecture under the contrastive learning framework. Unlike conventional methods that rely on ion-to-ion matching, DeepSearch adopts a data-driven approach to score peptide spectrum matches. DeepSearch is also the first deep learning-based method that can profile variable post-translational modifications in a zero-shot manner. We showed that DeepSearch's scoring scheme expressed less bias and did not require any statistical estimation. We validated DeepSearch's accuracy and robustness across various datasets, including those from species with diverse protein compositions and a modification-enriched dataset. DeepSearch sheds new light on database search methods in tandem mass spectrometry.

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

TL;DR

Abstract

Paper Structure (3 sections, 6 figures, 1 table)

This paper contains 3 sections, 6 figures, 1 table.

Main
Results
Discussion

Figures (6)

Figure 1: A. Conventional database search engines compare experimental MS/MS spectra with theoretical spectra generated from an in silico digested peptide database. B. DeepSearch performs in silico protein digestion and computes an embedding database for peptides. Spectrum is encoded into spectrum embedding and the cosine similarities between the spectrum embedding and peptide embeddings are computed with matrix multiplication. C. DeepSearch uses the in-batch contrastive learning framework without handcrafting negative pairs. D. DeepSearch adopts a transformer-based encoder-decoder architecture coupled with a contrastive learning framework. E. DeepSearch performs PSM re-ranking with the multimodal peptide decoder using the Phred quality score calculated from softmax probabilities of amino acids in the peptide sequence. F. DeepSearch performs zero-shot PTM profiling by shifting the theoretical spectrum with the corresponding PTM mass.
Figure 2: A. Search engines reported score distributions for DeepSearch, MSFragger, MSGF+, and MaxQuant. Peptides are grouped into 5 categories based on their length. B. DeepSearch reported score distribution by peptide length.
Figure 3: PSMs are controlled by raw score, search engine reported expect value, or estimated PEP for A. Arabidopsis thaliana dataset. B. HEK293 dataset. C. HEK293 dataset with methionine oxidation as variable PTM. D. Caenorhabditis elegans dataset. E. Escherichia coli dataset. F. HeLa dataset with methionine oxidation and phosphorylation of serine, threonine, and tyrosine as variable PTMs. Different colors represent different search engines, the y-axis represents the number of PSMs with an FDR of 1%, and MaxQuant failed on the F) HeLa dataset.
Figure 4: A. DeepSearch PSMs' score distribution with target-decoy strategy. B. Spectra identification rate against FDR controlled with reported scores for DeepSearch and other search engines. C. Spectra identification rate against FDR controlled with reported expect values for DeepSearch and other search engines. D. Peptide identification after 1% PSMs FDR controlled with reported scores for DeepSearch and other search engines. E. Peptide identification after 1% PSMs FDR controlled with reported expect values for DeepSearch and other search engines. F. Identified peptides by DeepSearch and MSGF+, divided based on the estimated confidence level. The FDR control is performed with scores for DeepSearch and expect values for MSGF+. Group-specific FDRs are calculated using decoy sequences in each group.
Figure 5: A. DeepSearch PSMs' score distribution with target-decoy strategy. B. Spectra identification rate against FDR controlled with reported scores for DeepSearch and other search engines. C. Spectra identification rate against FDR controlled with reported expect values for DeepSearch and other search engines. D. Peptide identification after 1% PSMs FDR filtering with reported scores for DeepSearch and other search engines. E. Peptide identification after 1% PSMs FDR filtering with reported expect values for DeepSearch and other search engines.
...and 1 more figures

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

TL;DR

Abstract

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Authors

TL;DR

Abstract

Table of Contents

Figures (6)