Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry
Yonghan Yu, Ming Li
TL;DR
DeepSearch presents the first deep learning-based end-to-end database search for tandem MS that replaces ion-to-ion scoring with cross-modal cosine similarity between spectrum and peptide embeddings. Trained with a contrastive learning framework on MassIVE v2 data, the transformer-based encoder-decoder jointly learns peptide inference and PSM re-ranking while enabling zero-shot PTM profiling by PTM-mass shifting. The method reduces scoring bias and demonstrates robustness across species, achieving competitive or superior PSM identification rates without relying on statistical estimation, and enabling zero-shot PTM profiling on phosphorylation-enriched data with substantial agreement to established tools. These results suggest a scalable, data-driven alternative to traditional heuristic database search engines in proteomics, with practical implications for unbiased peptide identification and PTM analysis.
Abstract
Peptide identification in mass spectrometry-based proteomics is crucial for understanding protein function and dynamics. Traditional database search methods, though widely used, rely on heuristic scoring functions and statistical estimations have to be introduced for a higher identification rate. Here, we introduce DeepSearch, the first deep learning-based end-to-end database search method for tandem mass spectrometry. DeepSearch leverages a modified transformer-based encoder-decoder architecture under the contrastive learning framework. Unlike conventional methods that rely on ion-to-ion matching, DeepSearch adopts a data-driven approach to score peptide spectrum matches. DeepSearch is also the first deep learning-based method that can profile variable post-translational modifications in a zero-shot manner. We showed that DeepSearch's scoring scheme expressed less bias and did not require any statistical estimation. We validated DeepSearch's accuracy and robustness across various datasets, including those from species with diverse protein compositions and a modification-enriched dataset. DeepSearch sheds new light on database search methods in tandem mass spectrometry.
