Codes for Limited-Magnitude Probability Error in DNA Storage

Wenkai Zhang; Zhiying Wang

Codes for Limited-Magnitude Probability Error in DNA Storage

Wenkai Zhang, Zhiying Wang

TL;DR

This work defines a limited-magnitude probability error (LMPE) channel for composite DNA letters, where each symbol is a probability vector of four nucleotides with fixed resolution $k$ and at most $t$ symbol errors of magnitude at most $l$. It develops a two-layer coding framework that first protects symbol classes (remainder/quotient structure) and then recovers actual probability vectors, and introduces multiple explicit constructions (remainder classes, reduced classes, improved Hamming, BCH-based schemes) with asymptotic optimality proven for the remainder-class approach. The bounds section provides sphere-packing and Gilbert–Varshamov results that guide code-size and rate tradeoffs as $n$ grows, $k$ grows large, and $l,t$ vary, while the systematic LMPE codes with Gray mapping enhance practical deployment. Collectively, the paper delivers concrete, scalable error-correcting schemes for DNA storage using composite letters, balancing redundancy, complexity, and implementation practicality, with clear paths to higher rates via asymptotic optimization and systematic designs.

Abstract

DNA, with remarkable properties of high density, durability, and replicability, is one of the most appealing storage media. Emerging DNA storage technologies use composite DNA letters, where information is represented by probability vectors, leading to higher information density and lower synthesizing costs than regular DNA letters. However, it faces the problem of inevitable noise and information corruption. This paper explores the channel of composite DNA letters in DNA-based storage systems and introduces block codes for limited-magnitude probability errors on probability vectors. First, outer and inner bounds for limited-magnitude probability error correction codes are provided. Moreover, code constructions are proposed where the number of errors is bounded by t, the error magnitudes are bounded by l, and the probability resolution is fixed as k. These constructions focus on leveraging the properties of limited-magnitude probability errors in DNA-based storage systems, leading to improved performance in terms of complexity and redundancy. In addition, the asymptotic optimality for one of the proposed constructions is established. Finally, systematic codes based on one of the proposed constructions are presented, which enable efficient information extraction for practical implementation.

Codes for Limited-Magnitude Probability Error in DNA Storage

TL;DR

This work defines a limited-magnitude probability error (LMPE) channel for composite DNA letters, where each symbol is a probability vector of four nucleotides with fixed resolution

and at most

symbol errors of magnitude at most

. It develops a two-layer coding framework that first protects symbol classes (remainder/quotient structure) and then recovers actual probability vectors, and introduces multiple explicit constructions (remainder classes, reduced classes, improved Hamming, BCH-based schemes) with asymptotic optimality proven for the remainder-class approach. The bounds section provides sphere-packing and Gilbert–Varshamov results that guide code-size and rate tradeoffs as

grows,

grows large, and

vary, while the systematic LMPE codes with Gray mapping enhance practical deployment. Collectively, the paper delivers concrete, scalable error-correcting schemes for DNA storage using composite letters, balancing redundancy, complexity, and implementation practicality, with clear paths to higher rates via asymptotic optimization and systematic designs.

Abstract

Paper Structure (16 sections, 19 theorems, 78 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 19 theorems, 78 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Problem statement
Bounds on LMPE correction codes
Constructions
Framework and example
Remainder class codes
Reduced class codes
Codes based on the improved Hamming code
Comparison
Asymptotic optimality
Systematic LMPE codes
Conclusion
Sphere-packing bound
Gilbert-Varshamov bound
Improved Hamming codes
...and 1 more sections

Key Result

Lemma 1

$\mathbf{e}=(e_1,e_2,e_3,e_4)$ is an $l$-limited-magnitude probability error for one symbol, if and only if

Figures (7)

Figure 1: The illustration of composite DNA-based storage.
Figure 2: Bounds on LMPE correction code with $n= 1023$, $k=100$, $l=10$ or $20$. The horizontal axis is the number of erroneous symbols $t$, and the vertical axis is the code rate. SPB and GVB represent sphere-packing (upper) bound and Gilbert-Varshamov (lower) bound, respectively.
Figure 3: An $(l=1,t=1)$ LMPE correction code with $n=28$, resolution $k=12$, based on the $27$-ary $(28,26)$ Hamming code. This figure demonstrates the encoding, corruption, and decoding of only one example codeword. Assume $(0,0,0,0)$ is mapped to $0$ in $GF(27)$, and $(2,1,0,0)$ is mapped to $17$ in $GF(27)$. The gray color denotes the parities. In Rows (1) and (2), all $28$ quotient vectors and the first $26$ remainder vectors are mapped from the information messages. In Row (3), we map the first 26 remainder vectors to elements in $GF(27)$. In Row (4) we use the $27$-ary $(28,26)$ Hamming code to generate two $0$'s in $GF(27)$ as parities (last two symbols), which are mapped back to remainder vectors $(0,0,0,0)$ in Row (5). We combine the remainder vectors and the quotient vectors to form transmitted probability vectors in Row (6). Row (7) denotes the corrupted word with a limited-magnitude probability error in the $2$nd symbol. In Row (8), we get the remainders through dividing by $3$. In Row (9), remainder vectors are mapped to elements in $GF(27)$. Then we use the $27$-ary $(28,26)$ Hamming code to decode in Row (10), and the corresponding correct remainder vectors are in Row (11). Based on the received probability words in Row (7) and the correct remainder vectors in Row (11), we form the corrected codewords in Row (12).
Figure 4: Encoding procedure for a systematic code. The Gray codeword consists of $g=2$ digits. (a) The information symbols are divided into two parts: remainder vector and quotient vector by the modulo operation. (b) The parities over the finite field are generated based on remainder vector, and every $g=2$ finite field digits are placed in one column. (c) The information symbols are formed by the modulo operation, and the parity symbols are formed based on the Gray code.
Figure 5: Gray mapping efficiency. For fixed $g,l$ and $q=(2l+1)^3$, the smallest $k$ found by Algorithm \ref{['alg:gray mapping']} is shown as the first number in the parenthesis. The second number in the parenthesis means the efficiency.
...and 2 more figures

Theorems & Definitions (37)

Definition 1: Limited-magnitude probability error (LMPE)
Lemma 1
Definition 2: LMPE correction code
Definition 3: Geodesic distance
Remark 1
Theorem 1
Theorem 2
Theorem 3
Remark 2
Definition 4: Quotient vector and remainder vector
...and 27 more

Codes for Limited-Magnitude Probability Error in DNA Storage

TL;DR

Abstract

Codes for Limited-Magnitude Probability Error in DNA Storage

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (37)