Covering All Bases: The Next Inning in DNA Sequencing Efficiency
Hadas Abraham, Rayn Gabrys, Eitan Yaakobi
TL;DR
This work addresses minimizing coverage depth in DNA-based storage when retrieving a subset of files, modeling the system with $m$ files each containing $k$ strands encoded into $mn$ strands and accessed via random sampling. It analyzes three coding schemes—Local MDS, Global MDS, and PMDS—using a Markov-chain framework to derive expected reads, distributions, and asymptotics for the retrieval metric $T(n,k;m,a)$. Two lower bounds are established and the schemes are compared both analytically and via simulations, revealing a trade-off between mean retrieval time and stability, with the Global MDS (and PMDS for certain regimes) offering robustness in multi-file access. The findings guide practical design of DNA storage systems by informing which coding approach minimizes read costs and latency under realistic random-access retrieval tasks, including extensions to multiple-file access and future work on noisy channels and exact characterizations.
Abstract
DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems.
