Table of Contents
Fetching ...

Levenshtein's Sequence Reconstruction Problem and Results for Larger Alphabet Sizes

Ville Junnila, Tero Laihonen, Tuomo Lehtilä

TL;DR

This work extends Levenshtein's sequence reconstruction problem from binary to $q$-ary alphabets, motivated by polymer-based storage such as DNA where $q>2$ is natural. It derives precise list-size bounds for erasures, substitutions, and mixed errors, including a tight bound for constant $q$-ary list size under substitution-only errors and a decoding algorithm based on a $q$-ary majority framework. The results reveal distinct behaviors of error types as $q$ grows, notably that insertions can yield unique reconstructions with high probability while substitutions do not, and deletions/erasures exhibit different large-$q limits. These findings have direct implications for designing robust DNA/polymer memories, where larger alphabets and mixed error profiles must be accounted for in reconstruction and decoding strategies.

Abstract

The problem of storing large amounts of information safely for a long period of time has become essential. One of the most promising new data storage mediums are the polymer-based data storage systems, like the DNA-storage system. These storage systems are highly durable and they consume very little energy to store the data. When information is retrieved from a storage, however, several different types of errors may occur in the process. It is known that the Levenshtein's sequence reconstruction framework is well-suited to overcome such errors and to retrieve the original information. Many of the previous results regarding Levenshtein's sequence reconstruction method are so far given only for the binary alphabet. However, larger alphabets are natural for the polymer-based data storage. For example, the quaternary alphabet is suitable for DNA-storage due to the four amino-acids in DNA. The results for larger alphabets often require, as we will see in this work, different and more complicated techniques compared to the binary case. Moreover, we show that an increase in the alphabet size makes some error types behave rather surprisingly.

Levenshtein's Sequence Reconstruction Problem and Results for Larger Alphabet Sizes

TL;DR

This work extends Levenshtein's sequence reconstruction problem from binary to -ary alphabets, motivated by polymer-based storage such as DNA where is natural. It derives precise list-size bounds for erasures, substitutions, and mixed errors, including a tight bound for constant -ary list size under substitution-only errors and a decoding algorithm based on a -ary majority framework. The results reveal distinct behaviors of error types as grows, notably that insertions can yield unique reconstructions with high probability while substitutions do not, and deletions/erasures exhibit different large-$q limits. These findings have direct implications for designing robust DNA/polymer memories, where larger alphabets and mixed error profiles must be accounted for in reconstruction and decoding strategies.

Abstract

The problem of storing large amounts of information safely for a long period of time has become essential. One of the most promising new data storage mediums are the polymer-based data storage systems, like the DNA-storage system. These storage systems are highly durable and they consume very little energy to store the data. When information is retrieved from a storage, however, several different types of errors may occur in the process. It is known that the Levenshtein's sequence reconstruction framework is well-suited to overcome such errors and to retrieve the original information. Many of the previous results regarding Levenshtein's sequence reconstruction method are so far given only for the binary alphabet. However, larger alphabets are natural for the polymer-based data storage. For example, the quaternary alphabet is suitable for DNA-storage due to the four amino-acids in DNA. The results for larger alphabets often require, as we will see in this work, different and more complicated techniques compared to the binary case. Moreover, we show that an increase in the alphabet size makes some error types behave rather surprisingly.

Paper Structure

This paper contains 7 sections, 22 theorems, 55 equations, 2 figures, 2 algorithms.

Key Result

Theorem 1

Let $N\geq N_{t,e}$ and $C\subseteq \mathbb{Z}_q^n$ be an $e$-error-correcting code. Then $\mathcal{L}=1$ if For $e=0$ there exists a simplified version of the bound above

Figures (2)

  • Figure 1: The Levenshtein's sequence reconstruction.
  • Figure 2: Transmitted word candidates $\mathbf{x}$ and $\mathbf{x}'$ at distance $d$ from each other together with output word $\mathbf{y}$ and variables $i_1,i_2,i_3,i_e$ and $j_e$ as in the proof of Theorem \ref{['ThmEraSubExactMinDist']}. Note that due to the symmetries of Hamming space, we may assume without loss of generality that $\mathbf{x}=\mathbf{0}$ and $\mathbf{x}'$ contains only $0$'s and $1$'s.

Theorems & Definitions (42)

  • Theorem 1: Levenshtein
  • Theorem 2
  • proof
  • Lemma 3: Corollary 1 of Levenshtein
  • Lemma 4
  • proof
  • Theorem 5
  • proof
  • Theorem 6
  • Lemma 7
  • ...and 32 more