Coding for Synthesis Defects
Ziyang Lu, Han Mao Kiah, Yiwei Zhang, Robert N. Grass, Eitan Yaakobi
TL;DR
The study addresses synthesis defects in parallel DNA strand synthesis for data storage by formulating two code families: KDCC for known defect cycles and SDCC for unknown defect locations. It develops a reduction from quaternary to binary codes via a signature (tilde{x}) and constructs explicit KDCC schemes for t=1 and t=2, achieving redundancies as low as $\log 4$ and $\log n+18\log 3$, respectively. For the unknown-defect setting, it introduces defect-locating strands to constrain error locations and provides 1-SDCC and 2-SDCC constructions with redundancy bounds scaling as $O((\log n)^2)$ and $O(M\log n)$ terms, respectively, demonstrating substantial redundancy savings over naive per-strand deletion codes. A lower bound shows that the 1-KDCC redundancy is essentially tight up to lower-order terms, underscoring the near-optimality of the proposed KDCC designs, while the SDCC framework offers a practical, scalable approach for multi-strand synthesis with defect localization. Together, the results advance efficient, low-redundancy coding for synthesis-based DNA data storage, enabling faster synthesis and reduced costs.
Abstract
Motivated by DNA based data storage system, we investigate the errors that occur when synthesizing DNA strands in parallel, where each strand is appended one nucleotide at a time by the machine according to a template supersequence. If there is a cycle such that the machine fails, then the strands meant to be appended at this cycle will not be appended, and we refer to this as a synthesis defect. In this paper, we present two families of codes correcting synthesis defects, which are t-known-synthesis-defect correcting codes and t-synthesis-defect correcting codes. For the first one, it is assumed that the defective cycles are known, and each of the codeword is a quaternary sequence. We provide constructions for this family of codes for t = 1, 2, with redundancy log 4 and log n+18 log 3, respectively. For the second one, the codeword is a set of M ordered sequences, and we give constructions for t = 1, 2 to show a strategy for constructing this family of codes. Finally, we derive a lower bound on the redundancy for single-known-synthesis-defect correcting codes, which assures that our construction is almost optimal.
