Representing Information on DNA using Patterns Induced by Enzymatic Labeling
Daniella Bar-Lev, Tuvi Etzion, Eitan Yaakobi, Zohar Yakhini
TL;DR
This work proposes a formal information-theoretic framework for encoding data into DNA by labeling a known template with patterns induced by designed labels, introducing a labeling channel modeled on a fixed DNA alphabet $\Sigma=\{\mathsf{A},\mathsf{C},\mathsf{G},\mathsf{T}\}$ and a reference sequence $S\in\Sigma^n$. It analyzes both fixed-length and variable-length labeling, defines $S$-uniquely-decodable labeling codes, and formulates optimization problems to maximize code size $M(n,\mathcal V)$ under constraints with executable labels. The paper provides a period-based upper bound $M(S) \le 2^{2\pi(S)-2} + 2^{\pi(S)} - 1$ and a complete result for $\pi(S)=2$, plus a fixed-length-label construction that achieves the bound when $S$ is $\ell$-repeat-free, where $M_\ell(n) \le \eta(n,\ell)$ with $\eta(n,\ell)$ counting binary sequences whose runs of ones have length at least $\ell$. For $\ell = c\log_4(n)$, the optimal code size scales as $M_\ell(n) = 2^{\Theta\left(\frac{\log\log(n)}{\log(n)}\cdot n\right)}$, indicating subexponential growth. The work connects labeling design to run-length limited constraints and de Bruijn sequences and outlines an efficient encoder–decoder achieving maximal size under stated conditions, laying groundwork for DNA-based data storage with enzymatic labeling while signaling future work on noise and synchronization in practical systems. All mathematical notation is presented with explicit delimiters, e.g., $\Sigma$, $S\in\Sigma^n$, $\pi(S)$, $M(S)$, $\eta(n,\ell)$, and related expressions.
Abstract
Enzymatic DNA labeling is a powerful tool with applications in biochemistry, molecular biology, biotechnology, medical science, and genomic research. This paper contributes to the evolving field of DNA-based data storage by presenting a formal framework for modeling DNA labeling in strings, specifically tailored for data storage purposes. Our approach involves a known DNA molecule as a template for labeling, employing patterns induced by a set of designed labels to represent information. One hypothetical implementation can use CRISPR-Cas9 and gRNA reagents for labeling. Various aspects of the general labeling channel, including fixed-length labels, are explored, and upper bounds on the maximal size of the corresponding codes are given. The study includes the development of an efficient encoder-decoder pair that is proven optimal in terms of maximum code size under specific conditions.
