Representing Information on DNA using Patterns Induced by Enzymatic Labeling

Daniella Bar-Lev; Tuvi Etzion; Eitan Yaakobi; Zohar Yakhini

Representing Information on DNA using Patterns Induced by Enzymatic Labeling

Daniella Bar-Lev, Tuvi Etzion, Eitan Yaakobi, Zohar Yakhini

TL;DR

This work proposes a formal information-theoretic framework for encoding data into DNA by labeling a known template with patterns induced by designed labels, introducing a labeling channel modeled on a fixed DNA alphabet $\Sigma=\{\mathsf{A},\mathsf{C},\mathsf{G},\mathsf{T}\}$ and a reference sequence $S\in\Sigma^n$. It analyzes both fixed-length and variable-length labeling, defines $S$-uniquely-decodable labeling codes, and formulates optimization problems to maximize code size $M(n,\mathcal V)$ under constraints with executable labels. The paper provides a period-based upper bound $M(S) \le 2^{2\pi(S)-2} + 2^{\pi(S)} - 1$ and a complete result for $\pi(S)=2$, plus a fixed-length-label construction that achieves the bound when $S$ is $\ell$-repeat-free, where $M_\ell(n) \le \eta(n,\ell)$ with $\eta(n,\ell)$ counting binary sequences whose runs of ones have length at least $\ell$. For $\ell = c\log_4(n)$, the optimal code size scales as $M_\ell(n) = 2^{\Theta\left(\frac{\log\log(n)}{\log(n)}\cdot n\right)}$, indicating subexponential growth. The work connects labeling design to run-length limited constraints and de Bruijn sequences and outlines an efficient encoder–decoder achieving maximal size under stated conditions, laying groundwork for DNA-based data storage with enzymatic labeling while signaling future work on noise and synchronization in practical systems. All mathematical notation is presented with explicit delimiters, e.g., $\Sigma$, $S\in\Sigma^n$, $\pi(S)$, $M(S)$, $\eta(n,\ell)$, and related expressions.

Abstract

Enzymatic DNA labeling is a powerful tool with applications in biochemistry, molecular biology, biotechnology, medical science, and genomic research. This paper contributes to the evolving field of DNA-based data storage by presenting a formal framework for modeling DNA labeling in strings, specifically tailored for data storage purposes. Our approach involves a known DNA molecule as a template for labeling, employing patterns induced by a set of designed labels to represent information. One hypothetical implementation can use CRISPR-Cas9 and gRNA reagents for labeling. Various aspects of the general labeling channel, including fixed-length labels, are explored, and upper bounds on the maximal size of the corresponding codes are given. The study includes the development of an efficient encoder-decoder pair that is proven optimal in terms of maximum code size under specific conditions.

Representing Information on DNA using Patterns Induced by Enzymatic Labeling

TL;DR

and a reference sequence

. It analyzes both fixed-length and variable-length labeling, defines

-uniquely-decodable labeling codes, and formulates optimization problems to maximize code size

under constraints with executable labels. The paper provides a period-based upper bound

and a complete result for

, plus a fixed-length-label construction that achieves the bound when

-repeat-free, where

with

counting binary sequences whose runs of ones have length at least

. For

, the optimal code size scales as

, indicating subexponential growth. The work connects labeling design to run-length limited constraints and de Bruijn sequences and outlines an efficient encoder–decoder achieving maximal size under stated conditions, laying groundwork for DNA-based data storage with enzymatic labeling while signaling future work on noise and synchronization in practical systems. All mathematical notation is presented with explicit delimiters, e.g.,

, and related expressions.

Abstract

Paper Structure (7 sections, 15 theorems, 21 equations)

This paper contains 7 sections, 15 theorems, 21 equations.

Introduction
Definitions, Problem Statement, and a First Bound
Definitions
Problems Statement
Basic Results using Periodicity
Fixed-Length Labels
Conclusions

Key Result

Lemma 1

If $S$ is a sequence with a single run of the symbol $\sigma$, then for any $S$-uniquely-decodable labeling code ${\cal C}$ we have that $M(S)= 2$. Furthermore, the code ${\cal C}_\sigma=\{\varnothing, \{\sigma\}\}$, is $S$-uniquely-decodable.

Theorems & Definitions (27)

Definition 1
Example 1
Definition 2
Example 2
Definition 3
Definition 4
Example 3
Lemma 1
Definition 5
Lemma 2
...and 17 more

Representing Information on DNA using Patterns Induced by Enzymatic Labeling

TL;DR

Abstract

Representing Information on DNA using Patterns Induced by Enzymatic Labeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)