Unrestricted Error-Type Codebook Generation for Error Correction Code in DNA Storage Inspired by NLP
Yi Lu, Yun Ma, Chenghao Li, Xin Zhang, Guangxiang Si
TL;DR
This paper tackles error correction for DNA storage by modeling editing errors as an IDS channel and introducing an NLP-inspired, bottom-up codebook generation framework. The core method centers on the Edit Computational Graph (ECG) and Derivative-Free Optimization (DFO) to iteratively grow a codebook that can correct substitutions, insertions, and deletions without reliance on a predefined error type. Key contributions include a formal ECG-based mechanism for feasible edit counting, a bit-encoded representation of codebook entries, and Monte Carlo-based codebook construction that achieves redundancy reductions relative to baseline deletion/substitution codes. The approach reframes codebook generation as a spell-correction-like task, offering potential improvements in encoding/decoding efficiency for DNA storage while acknowledging remaining challenges with burst errors and future avenues for enhanced robustness.
Abstract
Recently, DNA storage has surfaced as a promising alternative for data storage, presenting notable benefits in terms of storage capacity, cost-effectiveness in maintenance, and the capability for parallel replication. Mathematically, the DNA storage process can be conceptualized as an insertion, deletion, and substitution (IDS) channel. Due to the mathematical complexity associated with the Levenshtein distance, creating a code that corrects for IDS remains a challenging task. In this paper, we propose a bottom-up generation approach to grow the required codebook based on the computation of Edit Computational Graph (ECG) which differs from the algebraic constructions by incorporating the Derivative-Free Optimization (DFO) method. Specifically, this approach is regardless of the type of errors. Compared the results with the work for 1-substitution-1-deletion and 2-deletion, the redundancy is reduced by about 30-bit and 60-bit, respectively. As far as we know, our method is the first IDS-correcting code designed using classical Natural Language Process (NLP) techniques, marking a turning point in the field of error correction code research. Based on the codebook generated by our method, there may be significant breakthroughs in the complexity of encoding and decoding algorithms.
