Correcting Contextual Deletions in DNA Nanopore Readouts

Yuan-Pon Chen; Olgica Milenkovic; João Ribeiro; Jin Sima

Correcting Contextual Deletions in DNA Nanopore Readouts

Yuan-Pon Chen, Olgica Milenkovic, João Ribeiro, Jin Sima

TL;DR

This work addresses context-dependent synchronization errors in nanopore-based DNA data storage by introducing a run-length threshold model for contextual deletions and studying zero-error contextual deletion-correcting codes. It develops two complementary regimes: a logarithmic-threshold regime $k=C\log n$ with constant error count $t$, and a constant-threshold extremal regime, yielding redundancy bounds and efficient constructions (VT-type and GV-type) that outperform naive worst-case deletion schemes in relevant ranges. The authors provide both existential capacity bounds and practical, efficiently encodable/decodable constructions, including VT-like single- and double-deletion codes and a hash-RS framework for arbitrary constant $t$, with detailed analysis of redundancy and running times. These results advance robust synchronization-aware coding for DNA storage readouts, enabling more reliable data recovery from context-dependent nanopore errors. By connecting combinatorial bounds, automata-based encoding, and algebraic error-correction (Reed–Solomon), the work offers scalable, implementable strategies for high-density DNA storage systems using nanopore sequencing.

Abstract

The problem of designing codes for deletion-correction and synchronization has received renewed interest due to applications in DNA-based data storage systems that use nanopore sequencers as readout platforms. In almost all instances, deletions are assumed to be imposed independently of each other and of the sequence context. These assumptions are not valid in practice, since nanopore errors tend to occur within specific contexts. We study contextual nanopore deletion-errors through the example setting of deterministic single deletions following (complete) runlengths of length at least $k$. The model critically depends on the runlength threshold $k$, and we examine two regimes for $k$: a) $k=C\log n$ for a constant $C\in(0,1)$; in this case, we study error-correcting codes that can protect from a constant number $t$ of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between $(1-C)t\log n$ and $2(1-C)t\log n$, meaning that it is a ($1-C$)-fraction of that of arbitrary $t$-deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant $t$. In particular, for $t=1$ and $C>1/2$ we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b) $k$ equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least $k$, which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least $k$ with probability $p$ and taking the limit $p\to 1$. We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant $k$.

Correcting Contextual Deletions in DNA Nanopore Readouts

TL;DR

with constant error count

, and a constant-threshold extremal regime, yielding redundancy bounds and efficient constructions (VT-type and GV-type) that outperform naive worst-case deletion schemes in relevant ranges. The authors provide both existential capacity bounds and practical, efficiently encodable/decodable constructions, including VT-like single- and double-deletion codes and a hash-RS framework for arbitrary constant

, with detailed analysis of redundancy and running times. These results advance robust synchronization-aware coding for DNA storage readouts, enabling more reliable data recovery from context-dependent nanopore errors. By connecting combinatorial bounds, automata-based encoding, and algebraic error-correction (Reed–Solomon), the work offers scalable, implementable strategies for high-density DNA storage systems using nanopore sequencing.

Abstract

. The model critically depends on the runlength threshold

, and we examine two regimes for

: a)

for a constant

; in this case, we study error-correcting codes that can protect from a constant number

of contextual deletions, and show that the minimum redundancy (ignoring lower-order terms) is between

and

, meaning that it is a (

)-fraction of that of arbitrary

-deletion-correcting codes. To complement our non-constructive redundancy upper bound, we design efficiently and encodable and decodable codes for any constant

. In particular, for

and

we construct efficient codes with redundancy that essentially matches our non-constructive upper bound; b)

equal a constant; in this case we consider the extremal problem where the number of deletions is not bounded and a deletion is imposed after every run of length at least

, which we call the extremal contextual deletion channel. This combinatorial setting arises naturally by considering a probabilistic channel that introduces contextual deletions after each run of length at least

with probability

and taking the limit

. We obtain sharp bounds on the maximum achievable rate under the extremal contextual deletion channel for arbitrary constant

Paper Structure (32 sections, 32 theorems, 99 equations, 3 tables)

This paper contains 32 sections, 32 theorems, 99 equations, 3 tables.

Introduction
The Model
Our contributions
Logarithmic threshold, constant number of deletions
Extremal contextual deletion channel, constant threshold
Related work
Binary codes correcting worst-case deletions
Channels with context-dependent synchronization errors
Organization
Bounds on the redundancy of contextual deletion-correcting codes for logarithmic threshold and constant number of errors
The case $k\geq \log n$
Redundancy lower bound for threshold $k<\log n$
A Gilbert-Varshamov-type bound for contextual deletion-correcting codes
Efficient single and double contextual deletion-correcting codes via variants of Varshamov-Tenengolts codes
VT-type codes correcting a single contextual deletion
...and 17 more sections

Key Result

Theorem 1

If $k\geq \log n$, then there exists a $(t=n,k)$-contextual deletion-correcting code $\mathcal{C}\subseteq\{0,1\}^n$ with redundancy $O(1)$.

Theorems & Definitions (50)

Definition 1: Contextual deletion
Definition 2: Zero-error contextual deletion-correcting code
Definition 3: Contextual deletion channel
Theorem 1: Constant-redundancy codes for $k\geq \log n$
Theorem 2: Redundancy lower bound
Theorem 3: Non-constructive redundancy upper bound
Theorem 4: Efficiently encodable and decodable codes
Theorem 5
Theorem 6: schoeny17codes
Theorem 7: \ref{['thm:redundancy-lb']}, restated
...and 40 more

Correcting Contextual Deletions in DNA Nanopore Readouts

TL;DR

Abstract

Correcting Contextual Deletions in DNA Nanopore Readouts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (50)