Table of Contents
Fetching ...

Representing Information on DNA using Patterns Induced by Enzymatic Labeling

Daniella Bar-Lev, Tuvi Etzion, Eitan Yaakobi, Zohar Yakhini

TL;DR

This work proposes a formal information-theoretic framework for encoding data into DNA by labeling a known template with patterns induced by designed labels, introducing a labeling channel modeled on a fixed DNA alphabet $\Sigma=\{\mathsf{A},\mathsf{C},\mathsf{G},\mathsf{T}\}$ and a reference sequence $S\in\Sigma^n$. It analyzes both fixed-length and variable-length labeling, defines $S$-uniquely-decodable labeling codes, and formulates optimization problems to maximize code size $M(n,\mathcal V)$ under constraints with executable labels. The paper provides a period-based upper bound $M(S) \le 2^{2\pi(S)-2} + 2^{\pi(S)} - 1$ and a complete result for $\pi(S)=2$, plus a fixed-length-label construction that achieves the bound when $S$ is $\ell$-repeat-free, where $M_\ell(n) \le \eta(n,\ell)$ with $\eta(n,\ell)$ counting binary sequences whose runs of ones have length at least $\ell$. For $\ell = c\log_4(n)$, the optimal code size scales as $M_\ell(n) = 2^{\Theta\left(\frac{\log\log(n)}{\log(n)}\cdot n\right)}$, indicating subexponential growth. The work connects labeling design to run-length limited constraints and de Bruijn sequences and outlines an efficient encoder–decoder achieving maximal size under stated conditions, laying groundwork for DNA-based data storage with enzymatic labeling while signaling future work on noise and synchronization in practical systems. All mathematical notation is presented with explicit delimiters, e.g., $\Sigma$, $S\in\Sigma^n$, $\pi(S)$, $M(S)$, $\eta(n,\ell)$, and related expressions.

Abstract

Enzymatic DNA labeling is a powerful tool with applications in biochemistry, molecular biology, biotechnology, medical science, and genomic research. This paper contributes to the evolving field of DNA-based data storage by presenting a formal framework for modeling DNA labeling in strings, specifically tailored for data storage purposes. Our approach involves a known DNA molecule as a template for labeling, employing patterns induced by a set of designed labels to represent information. One hypothetical implementation can use CRISPR-Cas9 and gRNA reagents for labeling. Various aspects of the general labeling channel, including fixed-length labels, are explored, and upper bounds on the maximal size of the corresponding codes are given. The study includes the development of an efficient encoder-decoder pair that is proven optimal in terms of maximum code size under specific conditions.

Representing Information on DNA using Patterns Induced by Enzymatic Labeling

TL;DR

This work proposes a formal information-theoretic framework for encoding data into DNA by labeling a known template with patterns induced by designed labels, introducing a labeling channel modeled on a fixed DNA alphabet and a reference sequence . It analyzes both fixed-length and variable-length labeling, defines -uniquely-decodable labeling codes, and formulates optimization problems to maximize code size under constraints with executable labels. The paper provides a period-based upper bound and a complete result for , plus a fixed-length-label construction that achieves the bound when is -repeat-free, where with counting binary sequences whose runs of ones have length at least . For , the optimal code size scales as , indicating subexponential growth. The work connects labeling design to run-length limited constraints and de Bruijn sequences and outlines an efficient encoder–decoder achieving maximal size under stated conditions, laying groundwork for DNA-based data storage with enzymatic labeling while signaling future work on noise and synchronization in practical systems. All mathematical notation is presented with explicit delimiters, e.g., , , , , , and related expressions.

Abstract

Enzymatic DNA labeling is a powerful tool with applications in biochemistry, molecular biology, biotechnology, medical science, and genomic research. This paper contributes to the evolving field of DNA-based data storage by presenting a formal framework for modeling DNA labeling in strings, specifically tailored for data storage purposes. Our approach involves a known DNA molecule as a template for labeling, employing patterns induced by a set of designed labels to represent information. One hypothetical implementation can use CRISPR-Cas9 and gRNA reagents for labeling. Various aspects of the general labeling channel, including fixed-length labels, are explored, and upper bounds on the maximal size of the corresponding codes are given. The study includes the development of an efficient encoder-decoder pair that is proven optimal in terms of maximum code size under specific conditions.
Paper Structure (7 sections, 15 theorems, 21 equations)

This paper contains 7 sections, 15 theorems, 21 equations.

Key Result

Lemma 1

If $S$ is a sequence with a single run of the symbol $\sigma$, then for any $S$-uniquely-decodable labeling code ${\cal C}$ we have that $M(S)= 2$. Furthermore, the code ${\cal C}_\sigma=\{\varnothing, \{\sigma\}\}$, is $S$-uniquely-decodable.

Theorems & Definitions (27)

  • Definition 1
  • Example 1
  • Definition 2
  • Example 2
  • Definition 3
  • Definition 4
  • Example 3
  • Lemma 1
  • Definition 5
  • Lemma 2
  • ...and 17 more