Block Length Gain for Nanopore Channels
Yu-Ting Lin, Hsin-Po Wang, Venkatesan Guruswami
TL;DR
The paper tackles DNA data storage via nanopore channels, focusing on deletion errors that accompany synthesis/sequencing. It extends Geno-Weaving, previously shown to achieve capacity for substitutions, to handle deletions by coding across the large pool of strands (orthogonal to strand position) using polar codes, thereby mitigating the finite-block penalty inherent to short strands. The authors analyze probabilistic and combinatorial deletion models, compare to a traditional concatenation of inner deletion codes with outer Reed–Solomon codes, and demonstrate through simulations that Geno-Weaving attains higher rates and lower pool error across binary and quaternary alphabets, including insertion and mixed-error scenarios. The work shows that per-position polar coding over a vast strand ensemble yields a block-length gain that scales with the number of strands, enabling practical, high-reliability DNA data storage with current technology trajectories. $C( ext{BSC}(oldsymbol{ abla})) = 1 - h_2(oldsymbol{ abla})$ and $C( ext{QSC}(oldsymbol{ abla})) = 1 - h_4(oldsymbol{ abla})$ illustrate the capacity framework, while the orthogonal Geno-Weaving design circumvents explicit deletion-channel coding bottlenecks.
Abstract
DNA is an attractive candidate for data storage. Its millennial durability and nanometer scale offer exceptional data density and longevity. Its relevance to medical applications also drives advances in DNA-related biotechnology. To protect our data against errors, a straightforward approach uses one error-correcting code per DNA strand, with a Reed--Solomon code protecting the collection of strands. A downside is that current technology can only synthesize strands 200--300 nucleotides long. At this block length, the inner code rate suffers a significant finite-length penalty, making its effective capacity hard to characterize. Last year, we proposed $\textit{Geno-Weaving}$ in a JSAIT publication. The idea is to protect the same position across multiple strands using one code; this provably achieves capacity against substitution errors. In this paper, we extend the idea to combat deletion errors and show two more advantages of Geno-Weaving: (1) Because the number of strands is 3--4 orders of magnitude larger than the strand length, the finite-length penalty vanishes. (2) At realistic deletion rates $0.1\%$--$10\%$, Geno-Weaving designed for BSCs works well empirically, bypassing the need to tailor the design for deletion channels.
