Table of Contents
Fetching ...

Correcting Contextual Deletions in DNA Nanopore Readouts

Yuan-Pon Chen, Olgica Milenkovic, João Ribeiro, Jin Sima

TL;DR

This work addresses context-dependent synchronization errors in nanopore-based DNA data storage by introducing a run-length threshold model for contextual deletions and studying zero-error contextual deletion-correcting codes. It develops two complementary regimes: a logarithmic-threshold regime $k=C\log n$ with constant error count $t$, and a constant-threshold extremal regime, yielding redundancy bounds and efficient constructions (VT-type and GV-type) that outperform naive worst-case deletion schemes in relevant ranges. The authors provide both existential capacity bounds and practical, efficiently encodable/decodable constructions, including VT-like single- and double-deletion codes and a hash-RS framework for arbitrary constant $t$, with detailed analysis of redundancy and running times. These results advance robust synchronization-aware coding for DNA storage readouts, enabling more reliable data recovery from context-dependent nanopore errors. By connecting combinatorial bounds, automata-based encoding, and algebraic error-correction (Reed–Solomon), the work offers scalable, implementable strategies for high-density DNA storage systems using nanopore sequencing.

Abstract

The problem of designing codes for deletion-correction and synchronization has received renewed interest due to applications in DNA-based data storage systems that use nanopore sequencers as readout platforms. In almost all instances, deletions are assumed to be imposed independently of each other and of the sequence context. These assumptions are not valid in practice, since nanopore errors tend to occur within specific contexts. We study contextual nanopore deletion-errors through the example setting of deterministic single deletions following (complete) runlengths of length at least $k$. The model critically depends on the runlength threshold $k$, and we examine two regimes for $k$: a) $k=C\log n$ for a constant $C\in(0,1)$; in this case, we study error-correcting codes that can protect from a constant number $t$ of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between $(1-C)t\log n$ and $2(1-C)t\log n$, meaning that it is a ($1-C$)-fraction of that of arbitrary $t$-deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant $t$. In particular, for $t=1$ and $C>1/2$ we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b) $k$ equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least $k$, which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least $k$ with probability $p$ and taking the limit $p\to 1$. We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant $k$.

Correcting Contextual Deletions in DNA Nanopore Readouts

TL;DR

This work addresses context-dependent synchronization errors in nanopore-based DNA data storage by introducing a run-length threshold model for contextual deletions and studying zero-error contextual deletion-correcting codes. It develops two complementary regimes: a logarithmic-threshold regime with constant error count , and a constant-threshold extremal regime, yielding redundancy bounds and efficient constructions (VT-type and GV-type) that outperform naive worst-case deletion schemes in relevant ranges. The authors provide both existential capacity bounds and practical, efficiently encodable/decodable constructions, including VT-like single- and double-deletion codes and a hash-RS framework for arbitrary constant , with detailed analysis of redundancy and running times. These results advance robust synchronization-aware coding for DNA storage readouts, enabling more reliable data recovery from context-dependent nanopore errors. By connecting combinatorial bounds, automata-based encoding, and algebraic error-correction (Reed–Solomon), the work offers scalable, implementable strategies for high-density DNA storage systems using nanopore sequencing.

Abstract

The problem of designing codes for deletion-correction and synchronization has received renewed interest due to applications in DNA-based data storage systems that use nanopore sequencers as readout platforms. In almost all instances, deletions are assumed to be imposed independently of each other and of the sequence context. These assumptions are not valid in practice, since nanopore errors tend to occur within specific contexts. We study contextual nanopore deletion-errors through the example setting of deterministic single deletions following (complete) runlengths of length at least . The model critically depends on the runlength threshold , and we examine two regimes for : a) for a constant ; in this case, we study error-correcting codes that can protect from a constant number of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between and , meaning that it is a ()-fraction of that of arbitrary -deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant . In particular, for and we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b) equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least , which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least with probability and taking the limit . We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant .
Paper Structure (32 sections, 32 theorems, 99 equations, 3 tables)

This paper contains 32 sections, 32 theorems, 99 equations, 3 tables.

Key Result

Theorem 1

If $k\geq \log n$, then there exists a $(t=n,k)$-contextual deletion-correcting code $\mathcal{C}\subseteq\{0,1\}^n$ with redundancy $O(1)$.

Theorems & Definitions (50)

  • Definition 1: Contextual deletion
  • Definition 2: Zero-error contextual deletion-correcting code
  • Definition 3: Contextual deletion channel
  • Theorem 1: Constant-redundancy codes for $k\geq \log n$
  • Theorem 2: Redundancy lower bound
  • Theorem 3: Non-constructive redundancy upper bound
  • Theorem 4: Efficiently encodable and decodable codes
  • Theorem 5
  • Theorem 6: schoeny17codes
  • Theorem 7: \ref{['thm:redundancy-lb']}, restated
  • ...and 40 more