Table of Contents
Fetching ...

Error-Correcting Codes for Combinatorial Composite DNA

Omer Sabary, Inbal Preuss, Ryan Gabrys, Zohar Yakhini, Leon Anavy, Eitan Yaakobi

TL;DR

The paper tackles reliable data storage in DNA using combinatorial composite synthesis by modeling read-back as composite-asymmetric errors in weight-constrained shortmer symbols. It develops a VT-syndrome–based construction over a prime field to build $(t,e)$-composite asymmetric ECCs (CAECCs), with an explicit encoder/decoder and sphere-packing bounds that leverage the per-row weight constraint to achieve favorable redundancy. The authors derive both theoretical bounds and practical encoding/decoding schemes, and validate the model with real-sequencing data, showing non-negligible error probabilities that justify ECC deployment. The work further extends the model to $(t_1,t_2)$-CAECCs and $2$-CAECCs, offering constructive approaches and bounds that remain near-optimal under realistic parameter regimes, thereby enabling scalable, high-density DNA data storage with controlled redundancy.

Abstract

Data storage in DNA is developing as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth.

Error-Correcting Codes for Combinatorial Composite DNA

TL;DR

The paper tackles reliable data storage in DNA using combinatorial composite synthesis by modeling read-back as composite-asymmetric errors in weight-constrained shortmer symbols. It develops a VT-syndrome–based construction over a prime field to build -composite asymmetric ECCs (CAECCs), with an explicit encoder/decoder and sphere-packing bounds that leverage the per-row weight constraint to achieve favorable redundancy. The authors derive both theoretical bounds and practical encoding/decoding schemes, and validate the model with real-sequencing data, showing non-negligible error probabilities that justify ECC deployment. The work further extends the model to -CAECCs and -CAECCs, offering constructive approaches and bounds that remain near-optimal under realistic parameter regimes, thereby enabling scalable, high-density DNA data storage with controlled redundancy.

Abstract

Data storage in DNA is developing as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth.
Paper Structure (13 sections, 9 theorems, 17 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 9 theorems, 17 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The code $\mathcal{C}_{m, n, w}^{(t,e)}$ is a $(t,e)$-CAECC.

Figures (2)

  • Figure 1: Asymmetric combinatorial errors in experimental results. The x-axis represents the average reads per strand, in sampling from actual NGS data. The y-axis shows the number of observed $s_{i}$. Midpoints represent the mean count of observed $s_{i}$, and the whiskers represent the std of 10 repeated samplings aggregated over the different strands to each experiment.
  • Figure 2: Probability to observe $e$ asymmetric errors or more in a single combinatorial symbols. The x-axis indicates $e$ or more errors, each line represents a different number of analyzed reads ($R$) and the y-axis shows the error probability. Results for $w=5,R=1,5,10,20,25,e=0,1,…,4$.

Theorems & Definitions (21)

  • Example 1
  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Definition 3
  • Lemma 1
  • proof
  • ...and 11 more