Bounds and Constructions of $\ell$-Read Codes under the Hamming Metric
Yubo Sun, Gennian Ge
TL;DR
The paper studies $\ell$-read codes under the Hamming metric arising from nanopore read models, with a focus on $\ell=2$ and minimum distances $d\in\{3,4,5\}$, and extends to general $\ell$ for $d=3$. It provides a detailed structural characterization of when two sequences have a given $2$-read distance, and develops both lower and upper bounds on redundancy, achieving near-optimal results for several regimes. A key contribution is linking $2$-read $(n,3)_q$-codes to classical single-insertion reconstruction codes and improving redundancy bounds in insertion/deletion reconstruction settings, including tight results for $q=2$. The work also introduces constructive methods to build $\ell$-read codes with small redundancy by incorporating positional information and VT-type constraints, and extends the reconstruction-model framework to $\ell$-read codes, offering practical implications for robust data recovery in DNA storage systems.
Abstract
Nanopore sequencing is a promising technology for DNA sequencing. In this paper, we investigate a specific model of the nanopore sequencer, which takes a $q$-ary sequence of length $n$ as input and outputs a vector of length $n+\ell-1$ referred to as an $\ell$-read vector where the $i$-th entry is a multi-set composed of the $\ell$ elements located between the $(i-\ell+1)$-th and $i$-th positions of the input sequence. Considering the presence of substitution errors in the output vector, we study $\ell$-read codes under the Hamming metric. An $\ell$-read $(n,d)_q$-code is a set of $q$-ary sequences of length $n$ in which the Hamming distance between $\ell$-read vectors of any two distinct sequences is at least $d$. We first improve the result of Banerjee \emph{et al.}, who studied $\ell$-read $(n,d)_q$-codes with the constraint $\ell\geq 3$ and $d=3$. Then, we investigate the bounds and constructions of $2$-read codes with a minimum distance of $3$, $4$, and $5$, respectively. Our results indicate that when $d \in \{3,4\}$, the optimal redundancy of $2$-read $(n,d)_q$-codes is $o(\log_q n)$, while for $d=5$ it is $\log_q n+o(\log_q n)$. Additionally, we establish an equivalence between $2$-read $(n,3)_q$-codes and classical $q$-ary single-insertion reconstruction codes using two noisy reads. We improve the lower bound on the redundancy of classical $q$-ary single-insertion reconstruction codes as well as the upper bound on the redundancy of classical $q$-ary single-deletion reconstruction codes when using two noisy reads. Finally, we study $\ell$-read codes under the reconstruction model.
