Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry
Shiva Ebrahimi, Xuan Guo
TL;DR
De novo peptide sequencing from DIA mass spectrometry is challenged by multiplexed spectra from multiple precursors. The authors adapt a transformer-based approach, Transformer-DIA, by adding a spectrum encoder that fuses MS1, precursor information, and MS2 fragments and a transformer decoder to predict peptide sequences, with beam-search inference. Across three Homo sapiens DIA datasets (UTI, OC, Plasma), Transformer-DIA outperforms state-of-the-art DIA-specific methods DeepNovo-DIA and PepNet, particularly at the peptide level, while maintaining practical runtime. The work provides a scalable, open-source tool (Casanova-DIA lineage) to improve peptide discovery and profiling in DIA proteomics.
Abstract
Tandem mass spectrometry (MS/MS) stands as the predominant high-throughput technique for comprehensively analyzing protein content within biological samples. This methodology is a cornerstone driving the advancement of proteomics. In recent years, substantial strides have been made in Data-Independent Acquisition (DIA) strategies, facilitating impartial and non-targeted fragmentation of precursor ions. The DIA-generated MS/MS spectra present a formidable obstacle due to their inherent high multiplexing nature. Each spectrum encapsulates fragmented product ions originating from multiple precursor peptides. This intricacy poses a particularly acute challenge in de novo peptide/protein sequencing, where current methods are ill-equipped to address the multiplexing conundrum. In this paper, we introduce DiaTrans, a deep-learning model based on transformer architecture. It deciphers peptide sequences from DIA mass spectrometry data. Our results show significant improvements over existing STOA methods, including DeepNovo-DIA and PepNet. Casanovo-DIA enhances precision by 15.14% to 34.8%, recall by 11.62% to 31.94% at the amino acid level, and boosts precision by 59% to 81.36% at the peptide level. Integrating DIA data and our DiaTrans model holds considerable promise to uncover novel peptides and more comprehensive profiling of biological samples. Casanovo-DIA is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DiaTrans.
