Table of Contents
Fetching ...

On the Asymptotic Rate of Optimal Codes that Correct Tandem Duplications for Nanopore Sequencing

Wenjun Yu, Zuo Ye, Moshe Schwartz

TL;DR

The paper characterizes the asymptotic rate of optimal codes that correct tandem-duplication errors of length $k$ in the $\ell$-read vector arising from nanopore sequencing. It develops a nucleus-based, derivative-driven framework to construct optimal codes and analyzes both unbounded and constant-error regimes. For unbounded errors, exact rates are obtained in several regimes (e.g., $\ell=1$ and $\ell|k$), with upper bounds in the remaining cases and a Lovász Local Lemma-based lower bound that can approach 1 as $k+\ell$ grows. For a fixed number of errors $t$, a Sidon-set construction yields a redundancy of $t\log_q n+O(1)$ when $\ell|k$, with matching lower bounds in this case; otherwise, only an upper bound is established. The results extend known Duplication-correcting codes to the nanopore read model and provide insight into the fundamental limits and constructions of such codes for DNA sequencing applications.

Abstract

We study codes that can correct backtracking errors during nanopore sequencing. In this channel, a sequence of length $n$ over an alphabet of size $q$ is being read by a sliding window of length $\ell$, where from each window we obtain only its composition. Backtracking errors cause some windows to repeat, hence manifesting as tandem-duplication errors of length $k$ in the $\ell$-read vector of window compositions. While existing constructions for duplication-correcting codes can be straightforwardly adapted to this model, even resulting in optimal codes, their asymptotic rate is hard to find. In the regime of unbounded number of duplication errors, we either give the exact asymptotic rate of optimal codes, or bounds on it, depending on the values of $k$, $\ell$ and $q$. In the regime of a constant number of duplication errors, $t$, we find the redundancy of optimal codes to be $t\log_q n+O(1)$ when $\ell|k$, and only upper bounded by this quantity otherwise.

On the Asymptotic Rate of Optimal Codes that Correct Tandem Duplications for Nanopore Sequencing

TL;DR

The paper characterizes the asymptotic rate of optimal codes that correct tandem-duplication errors of length in the -read vector arising from nanopore sequencing. It develops a nucleus-based, derivative-driven framework to construct optimal codes and analyzes both unbounded and constant-error regimes. For unbounded errors, exact rates are obtained in several regimes (e.g., and ), with upper bounds in the remaining cases and a Lovász Local Lemma-based lower bound that can approach 1 as grows. For a fixed number of errors , a Sidon-set construction yields a redundancy of when , with matching lower bounds in this case; otherwise, only an upper bound is established. The results extend known Duplication-correcting codes to the nanopore read model and provide insight into the fundamental limits and constructions of such codes for DNA sequencing applications.

Abstract

We study codes that can correct backtracking errors during nanopore sequencing. In this channel, a sequence of length over an alphabet of size is being read by a sliding window of length , where from each window we obtain only its composition. Backtracking errors cause some windows to repeat, hence manifesting as tandem-duplication errors of length in the -read vector of window compositions. While existing constructions for duplication-correcting codes can be straightforwardly adapted to this model, even resulting in optimal codes, their asymptotic rate is hard to find. In the regime of unbounded number of duplication errors, we either give the exact asymptotic rate of optimal codes, or bounds on it, depending on the values of , and . In the regime of a constant number of duplication errors, , we find the redundancy of optimal codes to be when , and only upper bounded by this quantity otherwise.
Paper Structure (5 sections, 14 theorems, 125 equations, 2 tables)

This paper contains 5 sections, 14 theorems, 125 equations, 2 tables.

Key Result

Lemma 8

Let $k$, $\ell$, and $q$ be positive integers. Let $\boldsymbol{z}\in\Psi_{\ell,q}^*$, and assume $\boldsymbol{z}\Longrightarrow^t_k \boldsymbol{z}'$. Then Additionally, $\sigma_k(\Delta_k(\boldsymbol{z}'))-\sigma_k(\Delta_k(\boldsymbol{z}))$ contains only non-negative entries, and

Theorems & Definitions (26)

  • Example 1
  • Example 2
  • Example 3
  • Definition 4
  • Definition 5
  • Example 6
  • Definition 7
  • Lemma 8
  • Lemma 9
  • Corollary 10
  • ...and 16 more