Coding methods for string reconstruction from erroneous prefix-suffix compositions
Zitan Chen
TL;DR
This work addresses robust reconstruction of binary strings from prefix-suffix compositions under composition errors, a model motivated by polymer-based storage. It develops a framework based on generalized Reed-Solomon codes to guarantee polynomial-time decoding from up to $t$ composition errors, and presents a constant-rate construction capable of correcting $t=\Theta(n)$ errors for single-string recovery. It also extends to reconstructing $h$ arbitrary strings by jointly encoding them so that their error-free prefix-suffix multisets enable recovery at rate $1/(h+1)$, and extends this to erroneous settings by combining with asymptotically good binary codes to achieve constant-rate, efficient recovery for $t=\Theta(n)$. The results advance practical data retrieval from incomplete prefix-suffix information and offer multiple trade-offs between redundancy, rate, and error-correction capability.
Abstract
The number of zeros and the number of ones in a binary string are referred to as the composition of the string, and the prefix-suffix compositions of a string are a multiset formed by the compositions of the prefixes and suffixes of all possible lengths of the string. In this work, we present binary codes of length n in which every codeword can be efficiently reconstructed from its erroneous prefix-suffix compositions with at most t composition errors. All our constructions have decoding complexity polynomial in n and the best of our constructions has constant rate and can correct $t = Θ(n)$ errors. As a comparison, no prior constructions can afford to efficiently correct $t = Θ(n)$ arbitrary composition errors. Additionally, we propose a method of encoding h arbitrary strings of the same length so that they can be reconstructed from the multiset union of their error-free prefix-suffix compositions, at the expense of h-fold coding overhead. In contrast, existing methods can only recover h distinct strings, albeit with code rate asymptotically equal to 1/h. Building on the top of the proposed method, we also present a coding scheme that enables efficient recovery of h strings from their erroneous prefix-suffix compositions with $t = Θ(n)$ errors.
