Table of Contents
Fetching ...

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

Frederik Walter, Omer Sabary, Antonia Wachter-Zeh, Eitan Yaakobi

TL;DR

The paper develops a binary-model framework for composite DNA data storage, where composite symbols arise from mixtures of bases to enlarge the alphabet. It establishes non-asymptotic bounds and explicit constructions for correcting substitutions, strand losses, and deletions, by proving equivalences to classical metric spaces: $L_1$ for substitutions and $L_\infty$ for strand losses, with a VT-based single-deletion construction and a VT/Hamming-inspired combined-error scheme. The results yield tight partitions and optimal codes in several regimes (notably odd $M$ for deletions) and demonstrate how to compose error-correcting capabilities across multiple error types. These findings provide practical ECC designs for composite-DNA storage systems and contribute new insights into error-control under nonstandard alphabet expansions. The methods have potential impact on reliable, high-capacity DNA data storage by enabling robust correction of a broader class of physical errors in synthesis and sequencing.

Abstract

Composite DNA is a recent method to increase the base alphabet size in DNA-based data storage.This paper models synthesizing and sequencing of composite DNA and introduces coding techniques to correct substitutions, losses of entire strands, and symbol deletion errors. Non-asymptotic upper bounds on the size of codes with $t$ occurrences of these error types are derived. Explicit constructions are presented which can achieve the bounds.

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

TL;DR

The paper develops a binary-model framework for composite DNA data storage, where composite symbols arise from mixtures of bases to enlarge the alphabet. It establishes non-asymptotic bounds and explicit constructions for correcting substitutions, strand losses, and deletions, by proving equivalences to classical metric spaces: for substitutions and for strand losses, with a VT-based single-deletion construction and a VT/Hamming-inspired combined-error scheme. The results yield tight partitions and optimal codes in several regimes (notably odd for deletions) and demonstrate how to compose error-correcting capabilities across multiple error types. These findings provide practical ECC designs for composite-DNA storage systems and contribute new insights into error-control under nonstandard alphabet expansions. The methods have potential impact on reliable, high-capacity DNA data storage by enabling robust correction of a broader class of physical errors in synthesis and sequencing.

Abstract

Composite DNA is a recent method to increase the base alphabet size in DNA-based data storage.This paper models synthesizing and sequencing of composite DNA and introduces coding techniques to correct substitutions, losses of entire strands, and symbol deletion errors. Non-asymptotic upper bounds on the size of codes with occurrences of these error types are derived. Explicit constructions are presented which can achieve the bounds.
Paper Structure (16 sections, 8 theorems, 29 equations)

This paper contains 16 sections, 8 theorems, 29 equations.

Key Result

Proposition 1

Let $A_n^\mathbb{E}\left(M, t \right)$ be the maximum cardinatliy of a code able to correct $t$ errors of type $\mathbb{E}$ in $[0,M]^n$. Furthermore, let $\mathcal{P}_1 ,\dots , \mathcal{P}_r$ for a positive integer $r \in \mathbb{N}$ be an exhaustive partition of $[0,M]^n$ such that $\bigcup_{i \i

Theorems & Definitions (32)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Example 1
  • Definition 5
  • Claim 1
  • Definition 6
  • Claim 2
  • Proposition 1
  • ...and 22 more