Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Hadas Abraham; Rayn Gabrys; Eitan Yaakobi

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Hadas Abraham, Rayn Gabrys, Eitan Yaakobi

TL;DR

This work addresses minimizing coverage depth in DNA-based storage when retrieving a subset of files, modeling the system with $m$ files each containing $k$ strands encoded into $mn$ strands and accessed via random sampling. It analyzes three coding schemes—Local MDS, Global MDS, and PMDS—using a Markov-chain framework to derive expected reads, distributions, and asymptotics for the retrieval metric $T(n,k;m,a)$. Two lower bounds are established and the schemes are compared both analytically and via simulations, revealing a trade-off between mean retrieval time and stability, with the Global MDS (and PMDS for certain regimes) offering robustness in multi-file access. The findings guide practical design of DNA storage systems by informing which coding approach minimizes read costs and latency under realistic random-access retrieval tasks, including extensions to multiple-file access and future work on noisy channels and exact characterizations.

Abstract

DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the \emph{coverage depth problem} in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems.

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

TL;DR

This work addresses minimizing coverage depth in DNA-based storage when retrieving a subset of files, modeling the system with

files each containing

strands encoded into

strands and accessed via random sampling. It analyzes three coding schemes—Local MDS, Global MDS, and PMDS—using a Markov-chain framework to derive expected reads, distributions, and asymptotics for the retrieval metric

. Two lower bounds are established and the schemes are compared both analytically and via simulations, revealing a trade-off between mean retrieval time and stability, with the Global MDS (and PMDS for certain regimes) offering robustness in multi-file access. The findings guide practical design of DNA storage systems by informing which coding approach minimizes read costs and latency under realistic random-access retrieval tasks, including extensions to multiple-file access and future work on noisy channels and exact characterizations.

Abstract

Paper Structure (14 sections, 12 theorems, 42 equations, 2 figures, 2 tables)

This paper contains 14 sections, 12 theorems, 42 equations, 2 figures, 2 tables.

Introduction
Definitions, Problem Statement, Related Work
Definitions
Problem Statment
Previous Results
Random Access Expectation for a Single File ($a=1$)
The Local MDS Scheme
The Global MDS Scheme
The Partial MDS Scheme
Lower Bounds
Comparisons and Evaluations
Random Access Expectation for Multiple Files
Conclusion And Future Work
Acknowledgment

Key Result

Theorem 1

For any $1\leq k\leq n$ and $m\geq 1$, it holds that

Figures (2)

Figure 1: Illustrates the distribution of necessary sample sizes for file recovery and confidence levels of 90%, 95%, and 99% across 3 coding schemes. And the lower bounds specified in \ref{['th: ramdom access: better lower bound']} and in \ref{['lm:lowerboundmnmk']}.
Figure 2: Illustrates the cumulative distribution function (CDF) of necessary sample sizes for file recovery and the lower bounds specified in \ref{['th: ramdom access: better lower bound']} and in \ref{['lm:lowerboundmnmk']}.

Theorems & Definitions (29)

Theorem 1
proof
Corollary 1
proof
Theorem 2
proof
Corollary 2
proof
Definition 1
Theorem 3
...and 19 more

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

TL;DR

Abstract

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (29)