Table of Contents
Fetching ...

Coding for Ordered Composite DNA Sequences

Besart Dollma, Ohad Elishco, Eitan Yaakobi

TL;DR

This work introduces the ordered composite DNA channel, a model that decomposes a $k$-resolution composite sequence into $k$ ordered standard sequences and sends each through an independent channel. It develops two families of error-correcting codes for substitutions—$k$-resolution CECCs and $(e_0,\dots,e_{k-1})$-CECCs—and derives nontrivial upper and lower bounds on their cardinalities, including Levenshtein-type asymptotics and generalized sphere packing bounds. It extends the framework to deletions, providing deletion-specific bounds and systematic constructions using VT and Tenengolts codes. The results establish concrete coding strategies for reliable reconstruction of composite letters in DNA storage, with explicit bounds and constructions for both substitution and deletion errors, and identify open capacity questions for the ordered channel.

Abstract

To increase the information capacity of DNA storage, composite DNA letters were introduced. We propose a novel channel model for composite DNA in which composite sequences are decomposed into ordered standard non-composite sequences. The model is designed to handle any alphabet size and composite resolution parameter. We study the problem of reconstructing composite sequences of arbitrary resolution over the binary alphabet under substitution errors. We define two families of error-correcting codes and provide lower and upper bounds on their cardinality. In addition, we analyze the case in which a single deletion error occurs in the channel and present a systematic code construction for this setting. Finally, we briefly discuss the channel's capacity, which remains an open problem.

Coding for Ordered Composite DNA Sequences

TL;DR

This work introduces the ordered composite DNA channel, a model that decomposes a -resolution composite sequence into ordered standard sequences and sends each through an independent channel. It develops two families of error-correcting codes for substitutions—-resolution CECCs and -CECCs—and derives nontrivial upper and lower bounds on their cardinalities, including Levenshtein-type asymptotics and generalized sphere packing bounds. It extends the framework to deletions, providing deletion-specific bounds and systematic constructions using VT and Tenengolts codes. The results establish concrete coding strategies for reliable reconstruction of composite letters in DNA storage, with explicit bounds and constructions for both substitution and deletion errors, and identify open capacity questions for the ordered channel.

Abstract

To increase the information capacity of DNA storage, composite DNA letters were introduced. We propose a novel channel model for composite DNA in which composite sequences are decomposed into ordered standard non-composite sequences. The model is designed to handle any alphabet size and composite resolution parameter. We study the problem of reconstructing composite sequences of arbitrary resolution over the binary alphabet under substitution errors. We define two families of error-correcting codes and provide lower and upper bounds on their cardinality. In addition, we analyze the case in which a single deletion error occurs in the channel and present a systematic code construction for this setting. Finally, we briefly discuss the channel's capacity, which remains an open problem.

Paper Structure

This paper contains 19 sections, 60 theorems, 210 equations, 7 figures, 7 tables.

Key Result

Proposition 1

A $(k+1)$-ary $e$-error-correcting code is also a $k$-resolution $e$-CECC, i.e., $\mathcal{A}_{k+1}(n; e) \leq \mathcal{S}_k(n; e).$

Figures (7)

  • Figure 1: Ordered composite DNA channel for resolution $k=2$.
  • Figure 2: Transformations resulting from channel errors in $2$-resolution $e$-CECCs. Dashed edges indicate transformations requiring both channels to err at the same position.
  • Figure 3: Transformations resulting from channel errors in $(1, 0, \ldots, 0)$-CECCs. Dashed arrows represent transformations to the invalid symbol.
  • Figure 4: Transformations resulting from channel errors in $k$-resolution $1$-CECCs. Transformations to the invalid symbol are omitted.
  • Figure 5: Systematic encoder $\mathrm{ENC}$ of the Tenengolts $q$-ary single-deletion-correcting code. The message $\bm{s} \in \Sigma_q^m$ is encoded into a codeword $\bm{c} \in \Sigma_q^n$. Here $t = \lceil \log_q m \rceil$. The marker ${\color{blue}pp}$, where $p \equiv (s_m + 1) \mod{q}$, serves as a separator between the data part and the redundancy part.
  • ...and 2 more figures

Theorems & Definitions (66)

  • Example 1
  • Example 2
  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Theorem 1
  • Theorem 2
  • ...and 56 more