Table of Contents
Fetching ...

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang

TL;DR

This work addresses the challenge of missing fragmentation in de novo peptide sequencing from MS/MS data. It introduces LIPNovo, a latent-imputation-before-prediction framework that treats imputation as a set-prediction problem and uses bipartite matching to align latent theoretical peaks with observed spectra. By imputing latent peak representations prior to sequence decoding, LIPNovo achieves state-of-the-art performance across amino-acid, peptide, and PTM-level metrics on three benchmark datasets, significantly outperforming strong baselines. The approach demonstrates that latent-space augmentation can reduce ambiguity between spectra and peptide sequences, offering a practical and scalable paradigm for proteomics analysis.

Abstract

De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called \underline{\textbf{L}}atent \underline{\textbf{I}}mputation before \underline{\textbf{P}}rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at \href{https://github.com/usr922/LIPNovo}{https://github.com/usr922/LIPNovo}.

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

TL;DR

This work addresses the challenge of missing fragmentation in de novo peptide sequencing from MS/MS data. It introduces LIPNovo, a latent-imputation-before-prediction framework that treats imputation as a set-prediction problem and uses bipartite matching to align latent theoretical peaks with observed spectra. By imputing latent peak representations prior to sequence decoding, LIPNovo achieves state-of-the-art performance across amino-acid, peptide, and PTM-level metrics on three benchmark datasets, significantly outperforming strong baselines. The approach demonstrates that latent-space augmentation can reduce ambiguity between spectra and peptide sequences, offering a practical and scalable paradigm for proteomics analysis.

Abstract

De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called \underline{\textbf{L}}atent \underline{\textbf{I}}mputation before \underline{\textbf{P}}rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at \href{https://github.com/usr922/LIPNovo}{https://github.com/usr922/LIPNovo}.

Paper Structure

This paper contains 22 sections, 9 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison of amino acid-level precision between LIPNovo (ours) and existing methods under varying missing fragmentation ratios. As the missing ratio increases, performance deteriorates dramatically, highlighting the detrimental impact of the missing fragmentation issue. The proposed LIPNovo consistently outperforms existing methods across all missing ratios. Results are based on the test set (i.e., the yeast species) from the Nine-species dataset tran2017novo.
  • Figure 2: Illustration of the computational paradigm of LIPNovo. During training, LIPNovo generates a theoretical spectrum based on the target peptide (Figure \ref{['fig:theory_generation']}), which is then embedded using the spectrum encoder, along with the observed spectrum. Then, LIPNovo learns to impute the latent representation of the theoretical peaks. Bipartite matching is utilized to enable unique matching between imputed results and ground truths, followed by a tailed imputation training objective $\mathcal{L}_{\text{Imputation}}$. Finally, the highly confident imputation results are concatenated with the original spectrum representations and input into the peptide decoder to predict the peptide sequence. During inference, the upper part is discarded, eliminating the need for the theoretical spectrum during testing. "[$]" is the stop token.
  • Figure 3: Illustration of theoretical spectrum calculation. For example, by splitting the position at 'E' and 'P', we can derive the b2 ion (PE) and the y5 ion (PTIDE). The masses of these two ions can be calculated using the mass table of amino acid residues. Here, we assume a charge of +1 and set the intensity to 100%.
  • Figure 4: Missing fragmentation ratio vs. model performance. LIPNovo outperforms existing methods under various missing fragmentation ratios on Seven-species and HC-PT datasets.
  • Figure 5: Imputation quality vs. model performance. (Left) A smaller imputation loss corresponds to higher performance. (Right) The upper bound of LIPNovo, obtained by directly using ground truth representations instead of predicted representations for the theoretical spectrum on the test set.
  • ...and 2 more figures