Table of Contents
Fetching ...

Bounds and Constructions of $\ell$-Read Codes under the Hamming Metric

Yubo Sun, Gennian Ge

TL;DR

The paper studies $\ell$-read codes under the Hamming metric arising from nanopore read models, with a focus on $\ell=2$ and minimum distances $d\in\{3,4,5\}$, and extends to general $\ell$ for $d=3$. It provides a detailed structural characterization of when two sequences have a given $2$-read distance, and develops both lower and upper bounds on redundancy, achieving near-optimal results for several regimes. A key contribution is linking $2$-read $(n,3)_q$-codes to classical single-insertion reconstruction codes and improving redundancy bounds in insertion/deletion reconstruction settings, including tight results for $q=2$. The work also introduces constructive methods to build $\ell$-read codes with small redundancy by incorporating positional information and VT-type constraints, and extends the reconstruction-model framework to $\ell$-read codes, offering practical implications for robust data recovery in DNA storage systems.

Abstract

Nanopore sequencing is a promising technology for DNA sequencing. In this paper, we investigate a specific model of the nanopore sequencer, which takes a $q$-ary sequence of length $n$ as input and outputs a vector of length $n+\ell-1$ referred to as an $\ell$-read vector where the $i$-th entry is a multi-set composed of the $\ell$ elements located between the $(i-\ell+1)$-th and $i$-th positions of the input sequence. Considering the presence of substitution errors in the output vector, we study $\ell$-read codes under the Hamming metric. An $\ell$-read $(n,d)_q$-code is a set of $q$-ary sequences of length $n$ in which the Hamming distance between $\ell$-read vectors of any two distinct sequences is at least $d$. We first improve the result of Banerjee \emph{et al.}, who studied $\ell$-read $(n,d)_q$-codes with the constraint $\ell\geq 3$ and $d=3$. Then, we investigate the bounds and constructions of $2$-read codes with a minimum distance of $3$, $4$, and $5$, respectively. Our results indicate that when $d \in \{3,4\}$, the optimal redundancy of $2$-read $(n,d)_q$-codes is $o(\log_q n)$, while for $d=5$ it is $\log_q n+o(\log_q n)$. Additionally, we establish an equivalence between $2$-read $(n,3)_q$-codes and classical $q$-ary single-insertion reconstruction codes using two noisy reads. We improve the lower bound on the redundancy of classical $q$-ary single-insertion reconstruction codes as well as the upper bound on the redundancy of classical $q$-ary single-deletion reconstruction codes when using two noisy reads. Finally, we study $\ell$-read codes under the reconstruction model.

Bounds and Constructions of $\ell$-Read Codes under the Hamming Metric

TL;DR

The paper studies -read codes under the Hamming metric arising from nanopore read models, with a focus on and minimum distances , and extends to general for . It provides a detailed structural characterization of when two sequences have a given -read distance, and develops both lower and upper bounds on redundancy, achieving near-optimal results for several regimes. A key contribution is linking -read -codes to classical single-insertion reconstruction codes and improving redundancy bounds in insertion/deletion reconstruction settings, including tight results for . The work also introduces constructive methods to build -read codes with small redundancy by incorporating positional information and VT-type constraints, and extends the reconstruction-model framework to -read codes, offering practical implications for robust data recovery in DNA storage systems.

Abstract

Nanopore sequencing is a promising technology for DNA sequencing. In this paper, we investigate a specific model of the nanopore sequencer, which takes a -ary sequence of length as input and outputs a vector of length referred to as an -read vector where the -th entry is a multi-set composed of the elements located between the -th and -th positions of the input sequence. Considering the presence of substitution errors in the output vector, we study -read codes under the Hamming metric. An -read -code is a set of -ary sequences of length in which the Hamming distance between -read vectors of any two distinct sequences is at least . We first improve the result of Banerjee \emph{et al.}, who studied -read -codes with the constraint and . Then, we investigate the bounds and constructions of -read codes with a minimum distance of , , and , respectively. Our results indicate that when , the optimal redundancy of -read -codes is , while for it is . Additionally, we establish an equivalence between -read -codes and classical -ary single-insertion reconstruction codes using two noisy reads. We improve the lower bound on the redundancy of classical -ary single-insertion reconstruction codes as well as the upper bound on the redundancy of classical -ary single-deletion reconstruction codes when using two noisy reads. Finally, we study -read codes under the reconstruction model.
Paper Structure (23 sections, 41 theorems, 32 equations, 2 tables)

This paper contains 23 sections, 41 theorems, 32 equations, 2 tables.

Key Result

Lemma 1

Assume $\ell \geq 3$ and $\boldsymbol{x}\neq \boldsymbol{y}\in \Sigma_q^n$, $d_H(\mathcal{R}_{\ell}(\boldsymbol{x}),\mathcal{R}_{\ell}(\boldsymbol{y})) \leq 2$ holds if and only if there exist $t+2$ sequences $\boldsymbol{u}, \boldsymbol{w} \in \Sigma_q^{\geq 0}, \boldsymbol{v}_1, \ldots, \boldsymbo

Theorems & Definitions (55)

  • Definition 1
  • Remark 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Lemma 1: Theorem 2 of Banerjee-23-arxiv-nanopore
  • Lemma 2: Theorems 5 and 6 of Banerjee-23-arxiv-nanopore
  • Lemma 3: Definition 8 and Proposition 9 of Cai-22-IT-recon-edit
  • Lemma 4: Proposition 10 of Chrisnata-22-IT-reconstr-del or Theorem 5.4 of Sun-23-IT-BDR
  • ...and 45 more