Table of Contents
Fetching ...

Block Length Gain for Nanopore Channels

Yu-Ting Lin, Hsin-Po Wang, Venkatesan Guruswami

TL;DR

The paper tackles DNA data storage via nanopore channels, focusing on deletion errors that accompany synthesis/sequencing. It extends Geno-Weaving, previously shown to achieve capacity for substitutions, to handle deletions by coding across the large pool of strands (orthogonal to strand position) using polar codes, thereby mitigating the finite-block penalty inherent to short strands. The authors analyze probabilistic and combinatorial deletion models, compare to a traditional concatenation of inner deletion codes with outer Reed–Solomon codes, and demonstrate through simulations that Geno-Weaving attains higher rates and lower pool error across binary and quaternary alphabets, including insertion and mixed-error scenarios. The work shows that per-position polar coding over a vast strand ensemble yields a block-length gain that scales with the number of strands, enabling practical, high-reliability DNA data storage with current technology trajectories. $C( ext{BSC}(oldsymbol{ abla})) = 1 - h_2(oldsymbol{ abla})$ and $C( ext{QSC}(oldsymbol{ abla})) = 1 - h_4(oldsymbol{ abla})$ illustrate the capacity framework, while the orthogonal Geno-Weaving design circumvents explicit deletion-channel coding bottlenecks.

Abstract

DNA is an attractive candidate for data storage. Its millennial durability and nanometer scale offer exceptional data density and longevity. Its relevance to medical applications also drives advances in DNA-related biotechnology. To protect our data against errors, a straightforward approach uses one error-correcting code per DNA strand, with a Reed--Solomon code protecting the collection of strands. A downside is that current technology can only synthesize strands 200--300 nucleotides long. At this block length, the inner code rate suffers a significant finite-length penalty, making its effective capacity hard to characterize. Last year, we proposed $\textit{Geno-Weaving}$ in a JSAIT publication. The idea is to protect the same position across multiple strands using one code; this provably achieves capacity against substitution errors. In this paper, we extend the idea to combat deletion errors and show two more advantages of Geno-Weaving: (1) Because the number of strands is 3--4 orders of magnitude larger than the strand length, the finite-length penalty vanishes. (2) At realistic deletion rates $0.1\%$--$10\%$, Geno-Weaving designed for BSCs works well empirically, bypassing the need to tailor the design for deletion channels.

Block Length Gain for Nanopore Channels

TL;DR

The paper tackles DNA data storage via nanopore channels, focusing on deletion errors that accompany synthesis/sequencing. It extends Geno-Weaving, previously shown to achieve capacity for substitutions, to handle deletions by coding across the large pool of strands (orthogonal to strand position) using polar codes, thereby mitigating the finite-block penalty inherent to short strands. The authors analyze probabilistic and combinatorial deletion models, compare to a traditional concatenation of inner deletion codes with outer Reed–Solomon codes, and demonstrate through simulations that Geno-Weaving attains higher rates and lower pool error across binary and quaternary alphabets, including insertion and mixed-error scenarios. The work shows that per-position polar coding over a vast strand ensemble yields a block-length gain that scales with the number of strands, enabling practical, high-reliability DNA data storage with current technology trajectories. and illustrate the capacity framework, while the orthogonal Geno-Weaving design circumvents explicit deletion-channel coding bottlenecks.

Abstract

DNA is an attractive candidate for data storage. Its millennial durability and nanometer scale offer exceptional data density and longevity. Its relevance to medical applications also drives advances in DNA-related biotechnology. To protect our data against errors, a straightforward approach uses one error-correcting code per DNA strand, with a Reed--Solomon code protecting the collection of strands. A downside is that current technology can only synthesize strands 200--300 nucleotides long. At this block length, the inner code rate suffers a significant finite-length penalty, making its effective capacity hard to characterize. Last year, we proposed in a JSAIT publication. The idea is to protect the same position across multiple strands using one code; this provably achieves capacity against substitution errors. In this paper, we extend the idea to combat deletion errors and show two more advantages of Geno-Weaving: (1) Because the number of strands is 3--4 orders of magnitude larger than the strand length, the finite-length penalty vanishes. (2) At realistic deletion rates --, Geno-Weaving designed for BSCs works well empirically, bypassing the need to tailor the design for deletion channels.

Paper Structure

This paper contains 16 sections, 6 theorems, 15 equations, 10 figures, 2 tables.

Key Result

Theorem 2

When the capacity is written as a function in deletion probability $\delta$ and $\delta$ is sufficiently small, where is the base-$2$ binary entropy function.

Figures (10)

  • Figure 1: The photolithographic approach to synthesize arbitrary DNA. From left to right: Prepare multiple sites on a base plate, each protected by a cap. Selectively break the caps for the sites we want to extend; this is done by lighting, heating, or charging. Inject the next letter we want to extend the sites by (A in this case). The A's naturally attach to sites without a cap. Afterwards, break other caps and inject other letters to extend the sites further.
  • Figure 2: The nanopore sequencer. From left to right: Nanopores are installed on a membrane. Applying external voltage drives the ions to move through the pores, which generates currents we can measure. The pores have protein engines attached to them. These engines like to move DNA strands step-by-step. As the nucleotides block the pathway, fewer ions move, and less current would be measured. And since A, C, G, and T differ every so slightly in sizes, we can determine the blocking letter by observing the current drop very carefully.
  • Figure 3: The redundancy $r \coloneqq \ell - \log_q |\mathcal{B}|$, normalized by $d \log_q \ell$, as a function of the deletion number $d$. See Theorems \ref{['thm:d=1']}--\ref{['thm:r=5']} for details.
  • Figure 4: The Concatenation code design: Each strand is protected by an inner code that is a combinatorial deletion code. All strands together are protected by an outer code that is a Reed--Solomon code.
  • Figure 5: The estimated DNA code rate \ref{['concat']} if one concatenates binary combinatorial deletion codes with Reed--Solomon codes. Horizontal axis is deletion probability. Vertical axis is code rate. each curve represents a different $d$, the maximum number of deletions the inner code can handle. Lower $d$ has a higher overall code rate at low deletion probability. But low-$d$ curves decreases to zero faster than high-$d$ curves because the outer codes see more erasures. From left to right: explicit constructions (Theorems \ref{['thm:d=1']}, \ref{['thm:4d-1']}, and \ref{['thm:r=4']}), implicit constructions (Theorem \ref{['thm:dream']}), and putative constructions that meet the lower bound (Theorem \ref{['thm:dream']}).
  • ...and 5 more figures

Theorems & Definitions (6)

  • Theorem 2
  • Theorem 3
  • Theorem 4: Lev66
  • Theorem 6: SPC22
  • Theorem 7: GuH21
  • Theorem 8: LTX24