Optimizing Mirror-Image Peptide Sequence Design for Data Storage via Peptide Bond Cleavage Prediction
Yilong Lu, Si Chen, Songyan Gao, Han Liu, Xin Dong, Wenfeng Shen, Guangtai Ding
TL;DR
The paper tackles the bottleneck of sequencing mirror-image peptides for data storage by proposing an indirect optimization of peptide sequences through predicting peptide bond cleavage. It introduces MiPD513 as a mirror-image peptide MS/MS dataset, PBCLA to label bond cleavage events, and DBond, a deep learning model that fuses sequence, precursor, and MS environment features to predict cleavage. A comparison between multi-label and single-label strategies shows that decomposing the task into single-bond predictions yields stronger sequencing-ease guidance, with DBond-s achieving higher per-bond accuracy and F1 than DBond-m and existing baselines. The work provides a practical pathway to select mapping rules between raw data and D-amino acids, enabling optimized sequence design for robust, high-density, long-lived biological data storage. The approach combines experimental data, automated labeling, and deep learning to address a core challenge in de-novo sequencing of mirror-image peptides and underscores the potential impact on scalable bio-based data storage technologies.
Abstract
Traditional non-biological storage media, such as hard drives, face limitations in both storage density and lifespan due to the rapid growth of data in the big data era. Mirror-image peptides composed of D-amino acids have emerged as a promising biological storage medium due to their high storage density, structural stability, and long lifespan. The sequencing of mirror-image peptides relies on \textit{de-novo} technology. However, its accuracy is limited by the scarcity of tandem mass spectrometry datasets and the challenges that current algorithms encounter when processing these peptides directly. This study is the first to propose improving sequencing accuracy indirectly by optimizing the design of mirror-image peptide sequences. In this work, we introduce DBond, a deep neural network based model that integrates sequence features, precursor ion properties, and mass spectrometry environmental factors for the prediction of mirror-image peptide bond cleavage. In this process, sequences with a high peptide bond cleavage ratio, which are easy to sequence, are selected. The main contributions of this study are as follows. First, we constructed MiPD513, a tandem mass spectrometry dataset containing 513 mirror-image peptides. Second, we developed the peptide bond cleavage labeling algorithm (PBCLA), which generated approximately 12.5 million labeled data based on MiPD513. Third, we proposed a dual prediction strategy that combines multi-label and single-label classification. On an independent test set, the single-label classification strategy outperformed other methods in both single and multiple peptide bond cleavage prediction tasks, offering a strong foundation for sequence optimization.
