Table of Contents
Fetching ...

Constrained Coding for Composite DNA: Channel Capacity and Efficient Constructions

Tuan Thanh Nguyen, Chen Wang, Kui Cai, Yiwei Zhang, Zohar Yakhini

TL;DR

This work tackles the challenge of increasing DNA data storage capacity beyond the standard four-letter alphabet by using composite DNA letters formed as nucleotide mixtures. It develops constrained coding frameworks to enforce runlength-limited (RLL) and GC-content constraints across all composite letters, and it derives capacity formulas and efficient linear-time encoders/decoders. For several parameter regimes, the authors show capacity gains over the conventional limit of 2 bits/symbol, and they design encoding schemes that achieve capacity with minimal redundancy, including cases with a single redundant symbol. They also demonstrate that, under certain conditions, RLL and GC-content constraints can be combined without any loss in capacity, providing practical encoding strategies for capacity-approaching composite DNA codes with concrete synthesis-efficiency benefits.

Abstract

Composite DNA is a recent novel method to increase the information capacity of DNA-based data storage above the theoretical limit of 2 bits/symbol. In this method, every composite symbol does not store a single DNA nucleotide but a mixture of the four nucleotides in a predetermined ratio. By using different mixtures and ratios, the alphabet can be extended to have much more than four symbols in the naive approach. While this method enables higher data content per synthesis cycle, potentially reducing the DNA synthesis cost, it also imposes significant challenges for accurate DNA sequencing since the base-level errors can easily change the mixture of bases and their ratio, resulting in changes to the composite symbols. With this motivation, we propose efficient constrained coding techniques to enforce the biological constraints, including the runlength-limited constraint and the GC-content constraint, into every DNA synthesized oligo, regardless of the mixture of bases in each composite letter and their corresponding ratio. Our goals include computing the capacity of the constrained channel, constructing efficient encoders/decoders, and providing the best options for the composite letters to obtain capacity-approaching codes. For certain codes' parameters, our methods incur only one redundant symbol.

Constrained Coding for Composite DNA: Channel Capacity and Efficient Constructions

TL;DR

This work tackles the challenge of increasing DNA data storage capacity beyond the standard four-letter alphabet by using composite DNA letters formed as nucleotide mixtures. It develops constrained coding frameworks to enforce runlength-limited (RLL) and GC-content constraints across all composite letters, and it derives capacity formulas and efficient linear-time encoders/decoders. For several parameter regimes, the authors show capacity gains over the conventional limit of 2 bits/symbol, and they design encoding schemes that achieve capacity with minimal redundancy, including cases with a single redundant symbol. They also demonstrate that, under certain conditions, RLL and GC-content constraints can be combined without any loss in capacity, providing practical encoding strategies for capacity-approaching composite DNA codes with concrete synthesis-efficiency benefits.

Abstract

Composite DNA is a recent novel method to increase the information capacity of DNA-based data storage above the theoretical limit of 2 bits/symbol. In this method, every composite symbol does not store a single DNA nucleotide but a mixture of the four nucleotides in a predetermined ratio. By using different mixtures and ratios, the alphabet can be extended to have much more than four symbols in the naive approach. While this method enables higher data content per synthesis cycle, potentially reducing the DNA synthesis cost, it also imposes significant challenges for accurate DNA sequencing since the base-level errors can easily change the mixture of bases and their ratio, resulting in changes to the composite symbols. With this motivation, we propose efficient constrained coding techniques to enforce the biological constraints, including the runlength-limited constraint and the GC-content constraint, into every DNA synthesized oligo, regardless of the mixture of bases in each composite letter and their corresponding ratio. Our goals include computing the capacity of the constrained channel, constructing efficient encoders/decoders, and providing the best options for the composite letters to obtain capacity-approaching codes. For certain codes' parameters, our methods incur only one redundant symbol.
Paper Structure (11 sections, 9 theorems, 18 equations, 2 figures, 3 tables)

This paper contains 11 sections, 9 theorems, 18 equations, 2 figures, 3 tables.

Key Result

Theorem 1

Our encoder $\textsc{Enc}_{\ell; \Sigma_k}$ is well-defined. In other words, the replacement procedure is guaranteed to terminate.

Figures (2)

  • Figure 1: Possible synthesized DNA sequences from two data sequences of length $8$ in composite DNA alphabet. Given $\ell=3$ and $\epsilon=0.1$, i.e. the total number of ${\tt G}$ and ${\tt C}$ is within $[3,5]$.
  • Figure :

Theorems & Definitions (23)

  • Example 1
  • Definition 1
  • Example 2
  • Example 3
  • Example 4: Continuing from Example \ref{['sigma1']}
  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Remark 1
  • ...and 13 more