Table of Contents
Fetching ...

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Hadas Abraham, Rayn Gabrys, Eitan Yaakobi

TL;DR

This work addresses minimizing coverage depth in DNA-based storage when retrieving a subset of files, modeling the system with $m$ files each containing $k$ strands encoded into $mn$ strands and accessed via random sampling. It analyzes three coding schemes—Local MDS, Global MDS, and PMDS—using a Markov-chain framework to derive expected reads, distributions, and asymptotics for the retrieval metric $T(n,k;m,a)$. Two lower bounds are established and the schemes are compared both analytically and via simulations, revealing a trade-off between mean retrieval time and stability, with the Global MDS (and PMDS for certain regimes) offering robustness in multi-file access. The findings guide practical design of DNA storage systems by informing which coding approach minimizes read costs and latency under realistic random-access retrieval tasks, including extensions to multiple-file access and future work on noisy channels and exact characterizations.

Abstract

DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems.

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

TL;DR

This work addresses minimizing coverage depth in DNA-based storage when retrieving a subset of files, modeling the system with files each containing strands encoded into strands and accessed via random sampling. It analyzes three coding schemes—Local MDS, Global MDS, and PMDS—using a Markov-chain framework to derive expected reads, distributions, and asymptotics for the retrieval metric . Two lower bounds are established and the schemes are compared both analytically and via simulations, revealing a trade-off between mean retrieval time and stability, with the Global MDS (and PMDS for certain regimes) offering robustness in multi-file access. The findings guide practical design of DNA storage systems by informing which coding approach minimizes read costs and latency under realistic random-access retrieval tasks, including extensions to multiple-file access and future work on noisy channels and exact characterizations.

Abstract

DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems.
Paper Structure (14 sections, 12 theorems, 42 equations, 2 figures, 2 tables)

This paper contains 14 sections, 12 theorems, 42 equations, 2 figures, 2 tables.

Key Result

Theorem 1

For any $1\leq k\leq n$ and $m\geq 1$, it holds that

Figures (2)

  • Figure 1: Illustrates the distribution of necessary sample sizes for file recovery and confidence levels of 90%, 95%, and 99% across 3 coding schemes. And the lower bounds specified in \ref{['th: ramdom access: better lower bound']} and in \ref{['lm:lowerboundmnmk']}.
  • Figure 2: Illustrates the cumulative distribution function (CDF) of necessary sample sizes for file recovery and the lower bounds specified in \ref{['th: ramdom access: better lower bound']} and in \ref{['lm:lowerboundmnmk']}.

Theorems & Definitions (29)

  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Theorem 2
  • proof
  • Corollary 2
  • proof
  • Definition 1
  • Theorem 3
  • ...and 19 more