Table of Contents
Fetching ...

Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction

Yilong Lu, Si Chen, Songyan Gao, Han Liu, Xin Dong, Wenfeng Shen, Guangtai Ding

TL;DR

The paper tackles the bottleneck of sequencing mirror-image peptides for data storage by proposing an indirect optimization of peptide sequences through predicting peptide bond cleavage. It introduces MiPD513 as a mirror-image peptide MS/MS dataset, PBCLA to label bond cleavage events, and DBond, a deep learning model that fuses sequence, precursor, and MS environment features to predict cleavage. A comparison between multi-label and single-label strategies shows that decomposing the task into single-bond predictions yields stronger sequencing-ease guidance, with DBond-s achieving higher per-bond accuracy and F1 than DBond-m and existing baselines. The work provides a practical pathway to select mapping rules between raw data and D-amino acids, enabling optimized sequence design for robust, high-density, long-lived biological data storage. The approach combines experimental data, automated labeling, and deep learning to address a core challenge in de-novo sequencing of mirror-image peptides and underscores the potential impact on scalable bio-based data storage technologies.

Abstract

Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textit{de-novo} technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.

Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction

TL;DR

The paper tackles the bottleneck of sequencing mirror-image peptides for data storage by proposing an indirect optimization of peptide sequences through predicting peptide bond cleavage. It introduces MiPD513 as a mirror-image peptide MS/MS dataset, PBCLA to label bond cleavage events, and DBond, a deep learning model that fuses sequence, precursor, and MS environment features to predict cleavage. A comparison between multi-label and single-label strategies shows that decomposing the task into single-bond predictions yields stronger sequencing-ease guidance, with DBond-s achieving higher per-bond accuracy and F1 than DBond-m and existing baselines. The work provides a practical pathway to select mapping rules between raw data and D-amino acids, enabling optimized sequence design for robust, high-density, long-lived biological data storage. The approach combines experimental data, automated labeling, and deep learning to address a core challenge in de-novo sequencing of mirror-image peptides and underscores the potential impact on scalable bio-based data storage technologies.

Abstract

Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textit{de-novo} technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.

Paper Structure

This paper contains 17 sections, 7 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of data storage technology based on mirror-image peptide. (a) The peptide bond cleavage ratio predicted by DBond can be used to identify sequences that are easier to sequence, thereby finding the optimal mapping rules and optimizing sequence design. (b) The Data storage technology based on mirror-image peptide sequences can be divided into 2 stages: data storage and data recovery, further categorized into 6 steps. (c) During the sequencing of mirror-image peptides, de-novo methods are required to accurately identify the corresponding D-amino acid sequence for each specific mirror-image peptide.
  • Figure 2: Statistical information of MiPD513. (a) The x-axis represents the types of mirror-image peptides, while the y-axis indicates the number of tandem mass spectra. (b)The x-axis represents the sequence lengths of mirror-image peptides, while the y-axis indicates the number of mirror-image peptides. (c) The x-axis represents the types of D-amino acids, while the y-axis indicates to the number of mirror-image peptides.
  • Figure 3: The overall architecture of DBond. By adjusting the output dimensions of the MLP layer, it can be applied to both single-label classification tasks and multi-label classification tasks.
  • Figure 4: Labelling results of the PBCLA on MiPD513. (a) The x-axis represents the types of mirror-image peptides, the left y-axis indicates the corresponding sample count, and the right y-axis shows the corresponding positive sample ratio (the same applies below). (b) The x-axis represents the position of the peptide bond. (c) The x-axis represents the charge state of the precursor. (d) The x-axis represents the normalized collision energy. (e) The x-axis represents the scan number during the tandem mass spectrometry process.