Table of Contents
Fetching ...

AdaNovo: Adaptive \emph{De Novo} Peptide Sequencing with Conditional Mutual Information

Jun Xia, Shaorong Chen, Jingbo Zhou, Tianze Ling, Wenjie Du, Sizhe Liu, Stan Z. Li

TL;DR

AdaNovo is a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training, and excels in identifying amino acids with PTMs and exhibits robustness against data noise.

Abstract

Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo's state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.

AdaNovo: Adaptive \emph{De Novo} Peptide Sequencing with Conditional Mutual Information

TL;DR

AdaNovo is a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training, and excels in identifying amino acids with PTMs and exhibits robustness against data noise.

Abstract

Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo's state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.
Paper Structure (26 sections, 14 equations, 5 figures, 6 tables)

This paper contains 26 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparisons of various de novo sequencing methods in terms of amino acid-level precision. 'G' and 'A' denote Glycine and Alanine, respectively. Both of them are canonical amino acids. 'M(+15.99)' and 'Q(+.98)' represent oxidation of methionine and deamidation of glutamin, both of which are modified amino acids (the amino acids with PTMs). The results are for the human dataset, which is one of 9-species benchmark tran2017novo.
  • Figure 2: The identification workflow of shotgun proteomics wolters2001automated. The spectrum identification task this work study is to produce the peptide sequence (e.g., ATASPPRQK) for the observed spectrum. In the spectrum, peaks representing b- and y-ions of the associated peptide are highlighted in color, while grey peaks indicate unexpected fragmentation events or noise. The spectrum annotation are created using ProteomeXchange vizcaino2014proteomexchange.
  • Figure 3: Schematic diagram of AdaNovo framework.
  • Figure 4: Precision-coverage curves for AdaNovo and Casanovo (AA-level: Amino acid-level). Peptide curves are generated by arranging predicted peptides based on their confidence scores. In the case of amino acid-level curves, all amino acids within a specific peptide are assigned equal scores. Both at the amino acid and peptide levels, peptides that meet the precursor m/z filtering criteria are prioritized over those that do not. Similarly, the ranking is applied to all amino acids within peptides that pass the precursor m/z filter compared to those that do not. The transition between unfiltered and filtered entries is denoted by a red star on each curve.
  • Figure 5: The effects of the two hyperparameters $s_1$ and $s_2$ for adanovo. On the left are the peptide precision of the AdaNovo under different hyperparameter settings; on the right are the corresponding amino acid precision