Error-Correcting Codes for Combinatorial Composite DNA

Omer Sabary; Inbal Preuss; Ryan Gabrys; Zohar Yakhini; Leon Anavy; Eitan Yaakobi

Error-Correcting Codes for Combinatorial Composite DNA

Omer Sabary, Inbal Preuss, Ryan Gabrys, Zohar Yakhini, Leon Anavy, Eitan Yaakobi

TL;DR

The paper tackles reliable data storage in DNA using combinatorial composite synthesis by modeling read-back as composite-asymmetric errors in weight-constrained shortmer symbols. It develops a VT-syndrome–based construction over a prime field to build $(t,e)$-composite asymmetric ECCs (CAECCs), with an explicit encoder/decoder and sphere-packing bounds that leverage the per-row weight constraint to achieve favorable redundancy. The authors derive both theoretical bounds and practical encoding/decoding schemes, and validate the model with real-sequencing data, showing non-negligible error probabilities that justify ECC deployment. The work further extends the model to $(t_1,t_2)$-CAECCs and $2$-CAECCs, offering constructive approaches and bounds that remain near-optimal under realistic parameter regimes, thereby enabling scalable, high-density DNA data storage with controlled redundancy.

Abstract

Data storage in DNA is developing as a possible solution for archival digital data. Recently, to further increase the potential capacity of DNA-based data storage systems, the combinatorial composite DNA synthesis method was suggested. This approach extends the DNA alphabet by harnessing short DNA fragment reagents, known as shortmers. The shortmers are building blocks of the alphabet symbols, consisting of a fixed number of shortmers. Thus, when information is read, it is possible that one of the shortmers that forms part of the composition of a symbol is missing and therefore the symbol cannot be determined. In this paper, we model this type of error as a type of asymmetric error and propose code constructions that can correct such errors in this setup. We also provide a lower bound on the redundancy of such error-correcting codes and give an explicit encoder and decoder pair for our construction. Our suggested error model is also supported by an analysis of data from actual experiments that produced DNA according to the combinatorial scheme. Lastly, we also provide a statistical evaluation of the probability of observing such error events, as a function of read depth.

Error-Correcting Codes for Combinatorial Composite DNA

TL;DR

-composite asymmetric ECCs (CAECCs), with an explicit encoder/decoder and sphere-packing bounds that leverage the per-row weight constraint to achieve favorable redundancy. The authors derive both theoretical bounds and practical encoding/decoding schemes, and validate the model with real-sequencing data, showing non-negligible error probabilities that justify ECC deployment. The work further extends the model to

-CAECCs and

-CAECCs, offering constructive approaches and bounds that remain near-optimal under realistic parameter regimes, thereby enabling scalable, high-density DNA data storage with controlled redundancy.

Abstract

Paper Structure (13 sections, 9 theorems, 17 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 9 theorems, 17 equations, 2 figures, 1 table, 1 algorithm.

Introduction
Definitions and Problem Statement
Notations
Problem Statement
Code Constructions
Bounds on the Size of Composite Asymmetric Error-Correcting Codes
Explicit Encoder and Decoder
Simulation and Statistics
Statistics on real data
Evaluation of error probability
Extensions of the asymmetric error model
$(t_1,t_2)$-CAECCs
$2$-CAECC

Key Result

Theorem 1

The code $\mathcal{C}_{m, n, w}^{(t,e)}$ is a $(t,e)$-CAECC.

Figures (2)

Figure 1: Asymmetric combinatorial errors in experimental results. The x-axis represents the average reads per strand, in sampling from actual NGS data. The y-axis shows the number of observed $s_{i}$. Midpoints represent the mean count of observed $s_{i}$, and the whiskers represent the std of 10 repeated samplings aggregated over the different strands to each experiment.
Figure 2: Probability to observe $e$ asymmetric errors or more in a single combinatorial symbols. The x-axis indicates $e$ or more errors, each line represents a different number of analyzed reads ($R$) and the y-axis shows the error probability. Results for $w=5,R=1,5,10,20,25,e=0,1,…,4$.

Theorems & Definitions (21)

Example 1
Definition 1
Definition 2
Theorem 1
proof
Corollary 1
proof
Definition 3
Lemma 1
proof
...and 11 more

Error-Correcting Codes for Combinatorial Composite DNA

TL;DR

Abstract

Error-Correcting Codes for Combinatorial Composite DNA

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (21)