Table of Contents
Fetching ...

Unrestricted Error-Type Codebook Generation for Error Correction Code in DNA Storage Inspired by NLP

Yi Lu, Yun Ma, Chenghao Li, Xin Zhang, Guangxiang Si

TL;DR

This paper tackles error correction for DNA storage by modeling editing errors as an IDS channel and introducing an NLP-inspired, bottom-up codebook generation framework. The core method centers on the Edit Computational Graph (ECG) and Derivative-Free Optimization (DFO) to iteratively grow a codebook that can correct substitutions, insertions, and deletions without reliance on a predefined error type. Key contributions include a formal ECG-based mechanism for feasible edit counting, a bit-encoded representation of codebook entries, and Monte Carlo-based codebook construction that achieves redundancy reductions relative to baseline deletion/substitution codes. The approach reframes codebook generation as a spell-correction-like task, offering potential improvements in encoding/decoding efficiency for DNA storage while acknowledging remaining challenges with burst errors and future avenues for enhanced robustness.

Abstract

Recently, DNA storage has surfaced as a promising alternative for data storage, presenting notable benefits in terms of storage capacity, cost-effectiveness in maintenance, and the capability for parallel replication. Mathematically, the DNA storage process can be conceptualized as an insertion, deletion, and substitution (IDS) channel. Due to the mathematical complexity associated with the Levenshtein distance, creating a code that corrects for IDS remains a challenging task. In this paper, we propose a bottom-up generation approach to grow the required codebook based on the computation of Edit Computational Graph (ECG) which differs from the algebraic constructions by incorporating the Derivative-Free Optimization (DFO) method. Specifically, this approach is regardless of the type of errors. Compared the results with the work for 1-substitution-1-deletion and 2-deletion, the redundancy is reduced by about 30-bit and 60-bit, respectively. As far as we know, our method is the first IDS-correcting code designed using classical Natural Language Process (NLP) techniques, marking a turning point in the field of error correction code research. Based on the codebook generated by our method, there may be significant breakthroughs in the complexity of encoding and decoding algorithms.

Unrestricted Error-Type Codebook Generation for Error Correction Code in DNA Storage Inspired by NLP

TL;DR

This paper tackles error correction for DNA storage by modeling editing errors as an IDS channel and introducing an NLP-inspired, bottom-up codebook generation framework. The core method centers on the Edit Computational Graph (ECG) and Derivative-Free Optimization (DFO) to iteratively grow a codebook that can correct substitutions, insertions, and deletions without reliance on a predefined error type. Key contributions include a formal ECG-based mechanism for feasible edit counting, a bit-encoded representation of codebook entries, and Monte Carlo-based codebook construction that achieves redundancy reductions relative to baseline deletion/substitution codes. The approach reframes codebook generation as a spell-correction-like task, offering potential improvements in encoding/decoding efficiency for DNA storage while acknowledging remaining challenges with burst errors and future avenues for enhanced robustness.

Abstract

Recently, DNA storage has surfaced as a promising alternative for data storage, presenting notable benefits in terms of storage capacity, cost-effectiveness in maintenance, and the capability for parallel replication. Mathematically, the DNA storage process can be conceptualized as an insertion, deletion, and substitution (IDS) channel. Due to the mathematical complexity associated with the Levenshtein distance, creating a code that corrects for IDS remains a challenging task. In this paper, we propose a bottom-up generation approach to grow the required codebook based on the computation of Edit Computational Graph (ECG) which differs from the algebraic constructions by incorporating the Derivative-Free Optimization (DFO) method. Specifically, this approach is regardless of the type of errors. Compared the results with the work for 1-substitution-1-deletion and 2-deletion, the redundancy is reduced by about 30-bit and 60-bit, respectively. As far as we know, our method is the first IDS-correcting code designed using classical Natural Language Process (NLP) techniques, marking a turning point in the field of error correction code research. Based on the codebook generated by our method, there may be significant breakthroughs in the complexity of encoding and decoding algorithms.
Paper Structure (14 sections, 2 theorems, 21 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 14 sections, 2 theorems, 21 equations, 6 figures, 2 tables, 3 algorithms.

Key Result

Proposition 2.1

Let $C$ be a codebook. $r$ and $d$ are the same as above and satisfy the Condition eq:intersectEmptyCond. Then,

Figures (6)

  • Figure 1: Workflow of codebook generation
  • Figure 2: State transitions on different edges. (a) represents the case of $s_1[i]!=s_2[j]$ and (b) shows the case of $s_1[i]==s_2[j]$.
  • Figure 3: Sequence alignment procedure in ECG at step $k$. If $k<q$, we just start from the index $0$.
  • Figure 4: Structure of dynamic programming matrix with size $2\times (2q+1)$. Each element of the matrix is a bitarray of vertex in graph.
  • Figure 5: The flow chart of ECG iteration process.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 2.1
  • Definition 2.2
  • Proposition 2.1
  • Example 2.1
  • Definition 2.3
  • Definition 2.4
  • Example 2.2
  • Proposition 2.2
  • Definition 3.1
  • Example 3.1
  • ...and 5 more