Table of Contents
Fetching ...

Optimizing Sequencing Coverage Depth in DNA Storage: Insights From DNA Storage Data

Ruiying Cao, Xin Chen

TL;DR

For noisy channels, the theoretical lower bounds of sequencing coverage depth required for successful data decoding with high probability are studied, and several conclusions are derived that can further guide the efficient implementation of DNA storage experiments.

Abstract

DNA storage is now being considered as a new archival storage method for its durability and high information density, but still facing some challenges like high costs and low throughput. By reducing sequencing sample size for decoding digital data, minimizing DNA coverage depth helps lower both costs and system latency. Previous studies have mainly focused on minimizing coverage depth in uniform distribution channels under theoretical assumptions. In contrast, our work uses real DNA storage experimental data to extend this problem to log-normal distribution channels, a conclusion derived from our PCR and sequencing data analysis. In this framework, we investigate both noiseless and noisy channels. We first demonstrate a detailed positive correlation between MDS code rate and the expected minimum sequencing coverage depth. Moreover, we observe that the probability of successfully decoding all information in a single sequencing run decreases and then increases as code rate rises, when the sample size is optimized for complete decoding. Then we extend the lower bounds of the DNA coverage depth from uniform to log-normal noisy channels. The findings of this study provide valuable insights for the efficient execution of DNA storage experiments.

Optimizing Sequencing Coverage Depth in DNA Storage: Insights From DNA Storage Data

TL;DR

For noisy channels, the theoretical lower bounds of sequencing coverage depth required for successful data decoding with high probability are studied, and several conclusions are derived that can further guide the efficient implementation of DNA storage experiments.

Abstract

DNA storage is now being considered as a new archival storage method for its durability and high information density, but still facing some challenges like high costs and low throughput. By reducing sequencing sample size for decoding digital data, minimizing DNA coverage depth helps lower both costs and system latency. Previous studies have mainly focused on minimizing coverage depth in uniform distribution channels under theoretical assumptions. In contrast, our work uses real DNA storage experimental data to extend this problem to log-normal distribution channels, a conclusion derived from our PCR and sequencing data analysis. In this framework, we investigate both noiseless and noisy channels. We first demonstrate a detailed positive correlation between MDS code rate and the expected minimum sequencing coverage depth. Moreover, we observe that the probability of successfully decoding all information in a single sequencing run decreases and then increases as code rate rises, when the sample size is optimized for complete decoding. Then we extend the lower bounds of the DNA coverage depth from uniform to log-normal noisy channels. The findings of this study provide valuable insights for the efficient execution of DNA storage experiments.
Paper Structure (11 sections, 45 equations, 4 figures, 2 tables)

This paper contains 11 sections, 45 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Framework of this paper.
  • Figure 2: Visualization of different cycles of PCR data.
  • Figure 3: Expected coverage depth required for successful decoding under different code rates. The Uniform dist. curve represents the relationship between coverage depth and code rate in a uniform distribution channel. The PCR10, PCR30, and PCR60 curves correspond to the empirical channels derived from Dataset PCR10, Dataset PCR30, and Dataset PCR60, respectively, showing how coverage depth varies with code rate. Three points denote Monte Carlo simulations performed under the respective channel with code rate $R = 0.5$.
  • Figure 4: The inner plot illustrates the probability that a single experiment fails to decode all information when sequencing is performed at the expected sample size. This probability is quantified by the variance of each designed strand under that sequencing sample size, denoted as $f(K) \triangleq e^{-K \mathbb{E}\left[p_i^{(t)}\right]}-e^{-2K \mathbb{E}\left[p_i^{(t)}\right]}-\frac{n}{K}\left(K\mathbb{E}\left[p_i^{(t)}\right]\right)^2e^{-2K \mathbb{E}\left[p_i^{(t)}\right]}$. The outer plot depicts the trend of this probability, i.e., the derivative $f'(K)$. The outer plot indicates that the variance has only one peak, corresponding to the maximum variance point as annotated in the figure.