Table of Contents
Fetching ...

Embracing Errors Is More Efficient Than Avoiding Them Through Constrained Coding for DNA Data Storage

Franziska Weindel, Andreas L. Gimpel, Robert N. Grass, Reinhard Heckel

TL;DR

This paper determines the error regimes in which embracing substitutions is more efficient than constrained coding for DNA data storage, and suggests that constrained coding for substitution errors is inefficient for existing DNA data storage systems.

Abstract

DNA is an attractive medium for digital data storage. When data is stored on DNA, errors occur, which makes error-correcting coding techniques critical for reliable DNA data storage. To reduce the errors, a common technique is to include constraints that avoid homopolymers (consecutive repeated nucleotides) and balance the GC content, as sequences with homopolymers and unbalanced GC content are often associated with higher error rates. However, constrained coding comes at the cost of an increase in redundancy. An alternative is to control errors by randomizing the sequences, embracing errors, and paying for them with additional coding redundancy. In this paper, we determine the error regimes in which embracing substitutions is more efficient than constrained coding for DNA data storage. Our results suggest that constrained coding for substitution errors is inefficient for existing DNA data storage systems. Theoretical analysis indicates that for constrained coding to be efficient, the increase in substitution errors for nucleotides in homopolymers and sequences with unbalanced GC content must be very large. Additionally, empirical results show that the increase in substitution, deletion, and insertion rates for these nucleotides is minimal in existing DNA storage systems.

Embracing Errors Is More Efficient Than Avoiding Them Through Constrained Coding for DNA Data Storage

TL;DR

This paper determines the error regimes in which embracing substitutions is more efficient than constrained coding for DNA data storage, and suggests that constrained coding for substitution errors is inefficient for existing DNA data storage systems.

Abstract

DNA is an attractive medium for digital data storage. When data is stored on DNA, errors occur, which makes error-correcting coding techniques critical for reliable DNA data storage. To reduce the errors, a common technique is to include constraints that avoid homopolymers (consecutive repeated nucleotides) and balance the GC content, as sequences with homopolymers and unbalanced GC content are often associated with higher error rates. However, constrained coding comes at the cost of an increase in redundancy. An alternative is to control errors by randomizing the sequences, embracing errors, and paying for them with additional coding redundancy. In this paper, we determine the error regimes in which embracing substitutions is more efficient than constrained coding for DNA data storage. Our results suggest that constrained coding for substitution errors is inefficient for existing DNA data storage systems. Theoretical analysis indicates that for constrained coding to be efficient, the increase in substitution errors for nucleotides in homopolymers and sequences with unbalanced GC content must be very large. Additionally, empirical results show that the increase in substitution, deletion, and insertion rates for these nucleotides is minimal in existing DNA storage systems.
Paper Structure (20 sections, 7 theorems, 31 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 7 theorems, 31 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Achievable code rates for $m$-constrained coding and unconstrained coding: Let $H(p_r)$ be the entropy of a quaternary random variable that retains its state with probability $1-p_r$ and substitutes to one of the other three states with probability $p_r/3$: Define $q(r)$ as the asymptotic probability that a random nucleotide $X_i$ in a sequence $\mathbf{X} \in \{A,C,G,T\}^n$ occurs in a run of le

Figures (13)

  • Figure 1: Constrained coding removes error-prone sequences to reduce the number of errors at the cost of fewer sequences available to store information. Unconstrained coding controls the error rate by modulo-4 addition of the input sequences with a pseudo-random sequence. This reduces the occurrence of error-prone sequences, but may require more coding redundancy to achieve a vanishing probability of decoding error as the sequence length increases.
  • Figure 2: Error regimes in which $m$-constrained and unconstrained coding achieve a larger code rate. The error regimes are color-coded based on the associated achievable code rate difference $R_u-R_c$, where the gray line indicates similar performances.
  • Figure 3: Error regimes in which $\epsilon$-constrained and unconstrained coding achieve a larger Gilbert-Varshamov code rate lower bound. The error regimes are color-coded based on the code rate lower bound difference $R_u^l-R_c^l$, where gray indicates similar performances.
  • Figure 4: Weighted error rates (according to the sequence read distribution) in percent and their standard deviations as a function of run-length $r$.
  • Figure 5: Weighted error rates (according to the sequence read distribution) in percent and their standard deviations as a function of GC content $w$.
  • ...and 8 more figures

Theorems & Definitions (16)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Definition 4
  • Definition 5
  • Theorem 2
  • Lemma 1
  • Theorem 3
  • proof
  • ...and 6 more