Error-Correcting Codes for Nanopore Sequencing

Anisha Banerjee; Yonatan Yehezkeally; Antonia Wachter-Zeh; Eitan Yaakobi

Error-Correcting Codes for Nanopore Sequencing

Anisha Banerjee, Yonatan Yehezkeally, Antonia Wachter-Zeh, Eitan Yaakobi

TL;DR

The paper studies error-correcting codes for a nanopore sequencing model that combines intersymbol interference with composition-based readouts and substitution noise. It develops a rigorous graph-theoretic framework to bound redundancy for 1-substitution reads, proving a lower bound of $\log_q\log_q n$ symbols and presenting a redundancy-optimal construction for $\ell\ge 3$ that remains effective for reconstruction from two noisy reads. A concrete code $\mathcal C(n,\ell)$ achieving near-optimal redundancy is introduced, leveraging $(\log_q qn)$-RLL constraints and a parity-check on a transformed read, and shown to correct a single composition substitution; its applicability extends to multiple reads. The work lays a foundation for efficient coding in nanopore-like channels and suggests directions for handling additional error models and broader reconstruction tasks in DNA data storage contexts.

Abstract

Nanopore sequencing, superior to other sequencing technologies for DNA storage in multiple aspects, has recently attracted considerable attention. Its high error rates, however, demand thorough research on practical and efficient coding schemes to enable accurate recovery of stored data. To this end, we consider a simplified model of a nanopore sequencer inspired by Mao \emph{et al.}, incorporating intersymbol interference and measurement noise. Essentially, our channel model passes a sliding window of length $\ell$ over a $q$-ary input sequence that outputs the \textit{composition} of the enclosed $\ell$ bits and shifts by $δ$ positions with each time step. In this context, the composition of a $q$-ary vector $\bfx$ specifies the number of occurrences in $\bfx$ of each symbol in $\lbrace 0,1,\ldots, q-1\rbrace$. The resulting compositions vector, termed the \emph{read vector}, may also be corrupted by $t$ substitution errors. By employing graph-theoretic techniques, we deduce that for $δ=1$, at least $\log \log n$ symbols of redundancy are required to correct a single ($t=1$) substitution. Finally, for $\ell \geq 3$, we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors.

Error-Correcting Codes for Nanopore Sequencing

TL;DR

symbols and presenting a redundancy-optimal construction for

that remains effective for reconstruction from two noisy reads. A concrete code

achieving near-optimal redundancy is introduced, leveraging

-RLL constraints and a parity-check on a transformed read, and shown to correct a single composition substitution; its applicability extends to multiple reads. The work lays a foundation for efficient coding in nanopore-like channels and suggests directions for handling additional error models and broader reconstruction tasks in DNA data storage contexts.

Abstract

over a

-ary input sequence that outputs the \textit{composition} of the enclosed

bits and shifts by

positions with each time step. In this context, the composition of a

-ary vector

specifies the number of occurrences in

of each symbol in

. The resulting compositions vector, termed the \emph{read vector}, may also be corrupted by

substitution errors. By employing graph-theoretic techniques, we deduce that for

, at least

symbols of redundancy are required to correct a single (

) substitution. Finally, for

, we exploit some inherent characteristics of read vectors to arrive at an error-correcting code that is of optimal redundancy up to a (small) additive constant for this setting. This construction is also found to be optimal for the case of reconstruction from two noisy read vectors.

Paper Structure (13 sections, 24 theorems, 41 equations, 1 figure)

This paper contains 13 sections, 24 theorems, 41 equations, 1 figure.

Introduction
Preliminaries
Notations and Terminology
Properties of the Read Vectors
Error Model
Minimum Redundancy of Single-substitution (l,1)-read codes
Characterization of Confusable Read Vectors
An Upper Bound on the Code Size
Single Substitution Read Codes
Error correction with multiple reads
Conclusion
Proof of Lemma 7, non-binary extension of [8, Lemma 10]
Non-binary extension of [8, Lemma 13]

Key Result

Lemma 1

Take $\ell,\delta$ satisfying $\ell\equiv 0 \pmod{\delta}$, and let $\lbrace C_\alpha : \alpha\in \Sigma_{\ell/\delta}\rbrace$ be any $(\ell/\delta)$ arbitrary length-$(k+1)$ vectors of compositions, belonging to vectors in $\Sigma_q^\delta$, where $k=\lfloor\frac{n+\ell-(\alpha+1)\delta}{\ell}\rflo

Figures (1)

Figure 1: Simplified model of a nanopore sequencer

Theorems & Definitions (46)

Definition 1
Definition 2
Definition 3
Example 1
Definition 4
Example 2
Lemma 1
Corollary 1
Corollary 2
Example 3
...and 36 more

Error-Correcting Codes for Nanopore Sequencing

TL;DR

Abstract

Error-Correcting Codes for Nanopore Sequencing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (46)