Table of Contents
Fetching ...

Regressor-guided Diffusion Model for De Novo Peptide Sequencing with Explicit Mass Control

Shaorong Chen, Jingbo Zhou, Jun Xia

TL;DR

DiffuNovo is introduced, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control and is the first DNPS model to employ a diffusion model as its core backbone and achieves a significant reduction in mass error.

Abstract

The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint, that a predicted peptide's mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post-processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide-level mass loss guides model optimization, while at inference, regressor-based guidance from gradient-based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state-of-the-art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.

Regressor-guided Diffusion Model for De Novo Peptide Sequencing with Explicit Mass Control

TL;DR

DiffuNovo is introduced, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control and is the first DNPS model to employ a diffusion model as its core backbone and achieves a significant reduction in mass error.

Abstract

The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint, that a predicted peptide's mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post-processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide-level mass loss guides model optimization, while at inference, regressor-based guidance from gradient-based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state-of-the-art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.
Paper Structure (20 sections, 14 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 14 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Schematic of a typical bottom-up proteomics workflow zhang2013protein. Proteins are digested into peptides and analyzed by mass spectrometer to produce mass spectrum. DNPS is the process of inferring a peptide's sequence directly from its spectrum. The resulting peptide sequences can then be used for protein sequence assembly.
  • Figure 2: Performance comparison of DiffuNovo with state-of-the-art methods on mass-related metrics. Plausible prediction rate is the percentage of predicted peptides whose mass satisfies the experimental mass (higher is better). Mass error is the mean absolute error between the predicted peptide mass and the experimental mass (lower is better).
  • Figure 3: The architecture of DiffuNovo. In the Forward Diffusion Process, a ground-truth peptide sequence is converted into an embedding $z_0$ and then corrupted by adding noise over $T$ timesteps to produce a noisy latent variable $z_T$. DiffuNovo is trained to reverse this process. The Spectrum Encoder transforms the mass spectrum into embedding $\mathbf{x}$. The core of DiffuNovo is Reverse Diffusion Process where the Peptide Decoder denoise the noisy latent $z_t$ into a cleaner latent $z_{t-1}$ conditioned on $\mathbf{x}$. Critically, the Peptide Mass Regressor provides guidance to the Peptide Decoder at each timestep $t$, compelling the prediction adheres to the mass constraint. Finally, the clean latent $\hat{z}_0$ is passed to a prediction head to output the final predicted peptide.
  • Figure 4: Performance comparison on mass-related metrics.
  • Figure 5: Mass error loss curves. The figure illustrates the mass error loss (MSE) as a function of the normalized diffusion timestep. We conducted experiment on 5 random batches (2560 samples), calculating and plotting the loss for each batch (dashed line) and the average loss (solid line).