Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions
Frederik Walter, Omer Sabary, Antonia Wachter-Zeh, Eitan Yaakobi
TL;DR
The paper develops a binary-model framework for composite DNA data storage, where composite symbols arise from mixtures of bases to enlarge the alphabet. It establishes non-asymptotic bounds and explicit constructions for correcting substitutions, strand losses, and deletions, by proving equivalences to classical metric spaces: $L_1$ for substitutions and $L_\infty$ for strand losses, with a VT-based single-deletion construction and a VT/Hamming-inspired combined-error scheme. The results yield tight partitions and optimal codes in several regimes (notably odd $M$ for deletions) and demonstrate how to compose error-correcting capabilities across multiple error types. These findings provide practical ECC designs for composite-DNA storage systems and contribute new insights into error-control under nonstandard alphabet expansions. The methods have potential impact on reliable, high-capacity DNA data storage by enabling robust correction of a broader class of physical errors in synthesis and sequencing.
Abstract
Composite DNA is a recent method to increase the base alphabet size in DNA-based data storage.This paper models synthesizing and sequencing of composite DNA and introduces coding techniques to correct substitutions, losses of entire strands, and symbol deletion errors. Non-asymptotic upper bounds on the size of codes with $t$ occurrences of these error types are derived. Explicit constructions are presented which can achieve the bounds.
