Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

Frederik Walter; Omer Sabary; Antonia Wachter-Zeh; Eitan Yaakobi

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

Frederik Walter, Omer Sabary, Antonia Wachter-Zeh, Eitan Yaakobi

TL;DR

The paper develops a binary-model framework for composite DNA data storage, where composite symbols arise from mixtures of bases to enlarge the alphabet. It establishes non-asymptotic bounds and explicit constructions for correcting substitutions, strand losses, and deletions, by proving equivalences to classical metric spaces: $L_1$ for substitutions and $L_\infty$ for strand losses, with a VT-based single-deletion construction and a VT/Hamming-inspired combined-error scheme. The results yield tight partitions and optimal codes in several regimes (notably odd $M$ for deletions) and demonstrate how to compose error-correcting capabilities across multiple error types. These findings provide practical ECC designs for composite-DNA storage systems and contribute new insights into error-control under nonstandard alphabet expansions. The methods have potential impact on reliable, high-capacity DNA data storage by enabling robust correction of a broader class of physical errors in synthesis and sequencing.

Abstract

Composite DNA is a recent method to increase the base alphabet size in DNA-based data storage.This paper models synthesizing and sequencing of composite DNA and introduces coding techniques to correct substitutions, losses of entire strands, and symbol deletion errors. Non-asymptotic upper bounds on the size of codes with $t$ occurrences of these error types are derived. Explicit constructions are presented which can achieve the bounds.

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

TL;DR

for substitutions and

for strand losses, with a VT-based single-deletion construction and a VT/Hamming-inspired combined-error scheme. The results yield tight partitions and optimal codes in several regimes (notably odd

for deletions) and demonstrate how to compose error-correcting capabilities across multiple error types. These findings provide practical ECC designs for composite-DNA storage systems and contribute new insights into error-control under nonstandard alphabet expansions. The methods have potential impact on reliable, high-capacity DNA data storage by enabling robust correction of a broader class of physical errors in synthesis and sequencing.

Abstract

occurrences of these error types are derived. Explicit constructions are presented which can achieve the bounds.

Paper Structure (16 sections, 8 theorems, 29 equations)

This paper contains 16 sections, 8 theorems, 29 equations.

Introduction
Definitions and Problem Statement
Substitution Errors
Loss of Strands
Bounds on the Size of Codes for Correcting Loss of Strands
Code Construction for Strand Loss Errors
Deletion Errors
Size of Error Balls for Single Deletion Errors
Upper Bound on the Size of Deletion-Correcting Codes
Construction of a Single-Deletion-Correcting Code
Combination of Error Types
Conclusion
Further definitions
Proof of \ref{['cla:l1-equiv']}
Proof of \ref{['cla:linfty']}
...and 1 more sections

Key Result

Proposition 1

Let $A_n^\mathbb{E}\left(M, t \right)$ be the maximum cardinatliy of a code able to correct $t$ errors of type $\mathbb{E}$ in $[0,M]^n$. Furthermore, let $\mathcal{P}_1 ,\dots , \mathcal{P}_r$ for a positive integer $r \in \mathbb{N}$ be an exhaustive partition of $[0,M]^n$ such that $\bigcup_{i \i

Theorems & Definitions (32)

Definition 1
Definition 2
Definition 3
Definition 4
Example 1
Definition 5
Claim 1
Definition 6
Claim 2
Proposition 1
...and 22 more

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

TL;DR

Abstract

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (32)