Levenshtein's Sequence Reconstruction Problem and Results for Larger Alphabet Sizes
Ville Junnila, Tero Laihonen, Tuomo Lehtilä
TL;DR
This work extends Levenshtein's sequence reconstruction problem from binary to $q$-ary alphabets, motivated by polymer-based storage such as DNA where $q>2$ is natural. It derives precise list-size bounds for erasures, substitutions, and mixed errors, including a tight bound for constant $q$-ary list size under substitution-only errors and a decoding algorithm based on a $q$-ary majority framework. The results reveal distinct behaviors of error types as $q$ grows, notably that insertions can yield unique reconstructions with high probability while substitutions do not, and deletions/erasures exhibit different large-$q limits. These findings have direct implications for designing robust DNA/polymer memories, where larger alphabets and mixed error profiles must be accounted for in reconstruction and decoding strategies.
Abstract
The problem of storing large amounts of information safely for a long period of time has become essential. One of the most promising new data storage mediums are the polymer-based data storage systems, like the DNA-storage system. These storage systems are highly durable and they consume very little energy to store the data. When information is retrieved from a storage, however, several different types of errors may occur in the process. It is known that the Levenshtein's sequence reconstruction framework is well-suited to overcome such errors and to retrieve the original information. Many of the previous results regarding Levenshtein's sequence reconstruction method are so far given only for the binary alphabet. However, larger alphabets are natural for the polymer-based data storage. For example, the quaternary alphabet is suitable for DNA-storage due to the four amino-acids in DNA. The results for larger alphabets often require, as we will see in this work, different and more complicated techniques compared to the binary case. Moreover, we show that an increase in the alphabet size makes some error types behave rather surprisingly.
