Table of Contents
Fetching ...

Expected Recovery Time in DNA-based Distributed Storage Systems

Adi Levy, Roni Con, Eitan Yaakobi, Han Mao Kiah

TL;DR

This work develops a CCP-based framework to quantify the expected recovery time in DNA-based distributed storage systems (DNA-DSS) under erasure-coding schemes. Focusing on scalar MDS and MDS array codes, it derives precise asymptotics for the recovery time per container, revealing Gumbel-type fluctuations and explicit dependence on redundancy and system parameters via expressions like $\mathbb{E}[T_j] = \frac{n}{r}\ln n + \frac{n}{r}\ln \binom{M-1}{r} + \frac{\gamma n}{r} \pm o(n)$. It also extends to regenerating-array codes, providing a general upper bound $\mathbb{E}[T] \le \frac{n}{\alpha^*}\ln n + \frac{\beta^*}{b\alpha^*}n + o(n)$ and illustrating concrete latency gains with a $\mathbb{F}_3$ example. Collectively, the results quantify tradeoffs between redundancy, block structure, and recovery latency, informing code design for robust DNA-based archival storage.

Abstract

We initiate the study of DNA-based distributed storage systems, where information is encoded across multiple DNA data storage containers to achieve robustness against container failures. In this setting, data are distributed over $M$ containers, and the objective is to guarantee that the contents of any failed container can be reliably reconstructed from the surviving ones. Unlike classical distributed storage systems, DNA data storage containers are fundamentally constrained by sequencing technology, since each read operation yields the content of a uniformly random sampled strand from the container. Within this framework, we consider several erasure-correcting codes and analyze the expected recovery time of the data stored in a failed container. Our results are obtained by analyzing generalized versions of the classical Coupon Collector's Problem, which may be of independent interest.

Expected Recovery Time in DNA-based Distributed Storage Systems

TL;DR

This work develops a CCP-based framework to quantify the expected recovery time in DNA-based distributed storage systems (DNA-DSS) under erasure-coding schemes. Focusing on scalar MDS and MDS array codes, it derives precise asymptotics for the recovery time per container, revealing Gumbel-type fluctuations and explicit dependence on redundancy and system parameters via expressions like . It also extends to regenerating-array codes, providing a general upper bound and illustrating concrete latency gains with a example. Collectively, the results quantify tradeoffs between redundancy, block structure, and recovery latency, informing code design for robust DNA-based archival storage.

Abstract

We initiate the study of DNA-based distributed storage systems, where information is encoded across multiple DNA data storage containers to achieve robustness against container failures. In this setting, data are distributed over containers, and the objective is to guarantee that the contents of any failed container can be reliably reconstructed from the surviving ones. Unlike classical distributed storage systems, DNA data storage containers are fundamentally constrained by sequencing technology, since each read operation yields the content of a uniformly random sampled strand from the container. Within this framework, we consider several erasure-correcting codes and analyze the expected recovery time of the data stored in a failed container. Our results are obtained by analyzing generalized versions of the classical Coupon Collector's Problem, which may be of independent interest.
Paper Structure (10 sections, 24 theorems, 108 equations, 1 figure)

This paper contains 10 sections, 24 theorems, 108 equations, 1 figure.

Key Result

Theorem 1

Let $A^{(0)} = {\boldsymbol{0}}^{n\times (m+\rho)}, A^{(1)}, A^{(2)}, \ldots$ be a sequence of matrices constructed as follows: for each $t \in \mathbb{N}$, we draw $(v_1, \ldots, v_{m+\rho}) \sim \mathrm{Unif}([n]^{m+\rho})$ and set where ${\boldsymbol{e}}_a$ denotes the column vector with a $1$ in the $a$-th position and $0$ in all other entries. Define $T_{n,m,\rho}$ as the random variable tha

Figures (1)

  • Figure 1: An illustration of a DNA-DSS

Theorems & Definitions (60)

  • Remark 1
  • Definition 1: $(n,M,k,|\Sigma|)$ DNA-based distributed storage system (DNA-DSS)
  • Definition 2: The expected recovery time of a container
  • Definition 3: Coupon Collector's distribution
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Definition 4
  • Theorem 4
  • ...and 50 more