Table of Contents
Fetching ...

On the Reliability of Information Retrieval From MDS Coded Data in DNA Storage

Serge Kas Hanna

TL;DR

The paper tackles reliable data retrieval in DNA storage when data are protected by an outer and an inner MDS code under i.i.d substitution errors and nonuniform sequencing coverage. It develops a four-component theoretical framework that links post-consensus nucleotide/error rates, inner code decoding outcomes, outer code retrieval conditions, and a computable lower bound on end-to-end success probability. The main technical contributions include a recurrence-based method to compute the retrieval probability conditioned on a read profile, a Gaussian CLT-based approximation for large systems, and two practical bounds that facilitate optimization of sequencing and synthesis costs. The results yield insights into optimal redundancy allocation between inner and outer codes, demonstrate how nonuniform read distributions increase the required reads, and show that inner codes can be crucial in low-read regimes. The framework also accommodates extensions to asymmetric substitutions and non-MDS outer codes, providing a versatile tool for guiding design choices in practical DNA storage systems.

Abstract

This work presents a theoretical analysis of the probability of successfully retrieving data encoded with MDS codes (e.g., Reed-Solomon codes) in DNA storage systems. We study this probability under independent and identically distributed (i.i.d.) substitution errors, focusing on a common code design strategy that combines inner and outer MDS codes. Our analysis demonstrates how this probability depends on factors such as the total number of sequencing reads, their distribution across strands, the rates of the inner and outer codes, and the substitution error probabilities. These results provide actionable insights into optimizing DNA storage systems under reliability constraints, including determining the minimum number of sequencing reads needed for reliable data retrieval and identifying the optimal balance between the rates of inner and outer MDS codes.

On the Reliability of Information Retrieval From MDS Coded Data in DNA Storage

TL;DR

The paper tackles reliable data retrieval in DNA storage when data are protected by an outer and an inner MDS code under i.i.d substitution errors and nonuniform sequencing coverage. It develops a four-component theoretical framework that links post-consensus nucleotide/error rates, inner code decoding outcomes, outer code retrieval conditions, and a computable lower bound on end-to-end success probability. The main technical contributions include a recurrence-based method to compute the retrieval probability conditioned on a read profile, a Gaussian CLT-based approximation for large systems, and two practical bounds that facilitate optimization of sequencing and synthesis costs. The results yield insights into optimal redundancy allocation between inner and outer codes, demonstrate how nonuniform read distributions increase the required reads, and show that inner codes can be crucial in low-read regimes. The framework also accommodates extensions to asymmetric substitutions and non-MDS outer codes, providing a versatile tool for guiding design choices in practical DNA storage systems.

Abstract

This work presents a theoretical analysis of the probability of successfully retrieving data encoded with MDS codes (e.g., Reed-Solomon codes) in DNA storage systems. We study this probability under independent and identically distributed (i.i.d.) substitution errors, focusing on a common code design strategy that combines inner and outer MDS codes. Our analysis demonstrates how this probability depends on factors such as the total number of sequencing reads, their distribution across strands, the rates of the inner and outer codes, and the substitution error probabilities. These results provide actionable insights into optimizing DNA storage systems under reliability constraints, including determining the minimum number of sequencing reads needed for reliable data retrieval and identifying the optimal balance between the rates of inner and outer MDS codes.

Paper Structure

This paper contains 26 sections, 7 theorems, 69 equations, 4 figures.

Key Result

Lemma 1

For a given number of reads $r\in \mathbb{N}$, the post-consensus nucleotide error rate is given by and the inner code symbol error rate is $\epsilon'_{(r)} = 1 - \left(1- \epsilon_{(r)}\right)^{\frac{m}{2}}$, where $\omega_{(\boldsymbol{\kappa})} \triangleq \sum_{i=1}^4 \mathds{1}_{\{\kappa_i = \kappa_1\}}$ and

Figures (4)

  • Figure 1: Normalized histograms of two probability vectors sampled from a symmetric $N$-dimensional Dirichlet distribution $\text{Dir}_N(\xi)$, with $N=10526$ and $\xi=3,9$.
  • Figure 2: Probabilities of (a) successful retrieval and (b) retrieval error versus total number of reads $R_{\text{all}}$ for the two sampling probability vectors in Fig. \ref{['fig1']}.
  • Figure 3: Minimum information read depth $R^{\star}_{\text{all}}/K$ versus (a) Outer MDS code rate $\rho_{\text{\normalfont out}}$; and (b) Dirichlet parameter $\xi$. The Dirichlet parameter is fixed to $\xi=3$ in (a), and the outer code rate is fixed to $\rho_{\text{\normalfont out}}=0.9$ in (b).
  • Figure 4: Plot (a) shows the optimal information density $\Delta^{\star}=2\rho_{\text{in}}^{\star}\rho_{\text{out}}^{\star}$ in bits/NT as a function of the information read depth; and (b) shows the corresponding optimal code rates $\rho_{\text{in}}^{\star}$, $\rho_{\text{out}}^{\star}$, and $\rho^{\star}=\rho_{\text{in}}^{\star}\rho_{\text{out}}^{\star}$. The Dirichlet parameter is fixed to $\xi=3$.

Theorems & Definitions (11)

  • Remark 1: Rate-reliability trade-off in reconstruction
  • Lemma 1
  • Lemma 2
  • Theorem 3
  • Corollary 4
  • Remark 2
  • Corollary 5
  • Lemma 6
  • Remark 3
  • Theorem 7
  • ...and 1 more