Table of Contents
Fetching ...

On the Capacity of DNA Labeling

Dganit Hanania, Daniella Bar-Lev, Yevgeni Nogin, Yoav Shechtman, Eitan Yaakobi

TL;DR

The paper models DNA labeling as a labeled-channel problem where DNA sequences are marked by short patterns, producing a (k+1)-ary output and defining the labeling capacity ${\mathsf{cap}(\underline{\boldsymbol \alpha})}$ as the asymptotic rate ${\limsup}_{n\to\infty} \frac{\log_2|F_n(\underline{\boldsymbol \alpha})|}{n}$. It develops a comprehensive framework linking labeling capacity to constrained system theory, deriving exact capacity expressions via Perron eigenvalues for single labels under non-cyclic, periodic, and cyclic-overlap regimes, and extends to multiple labels with non-overlapping and overlapping configurations. Key contributions include closed-form capacity polynomials for various label classes, ordering of labels by capacity especially for lengths $\ell\leq5$, and results on the minimal number of labels required to achieve full $(\log_2 q)$ capacity, plus maximal capacities for two or more labels with detailed case analyses. The work connects DNA storage labeling to run-length constrained systems, providing practical guidance on label design and fundamental limits, with implications for high-rate DNA-based information encoding. ${\mathsf{cap}}$ values are given by roots of characteristic polynomials (e.g., $x^\ell-x^{\ell-1}-1$, $x^{\ell+1}-x^\ell-x^{\ell-p+1}+x^{\ell-p}-1$, etc.), reflecting the combinatorial structure of labeling constraints.

Abstract

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k + 1)- ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.

On the Capacity of DNA Labeling

TL;DR

The paper models DNA labeling as a labeled-channel problem where DNA sequences are marked by short patterns, producing a (k+1)-ary output and defining the labeling capacity as the asymptotic rate . It develops a comprehensive framework linking labeling capacity to constrained system theory, deriving exact capacity expressions via Perron eigenvalues for single labels under non-cyclic, periodic, and cyclic-overlap regimes, and extends to multiple labels with non-overlapping and overlapping configurations. Key contributions include closed-form capacity polynomials for various label classes, ordering of labels by capacity especially for lengths , and results on the minimal number of labels required to achieve full capacity, plus maximal capacities for two or more labels with detailed case analyses. The work connects DNA storage labeling to run-length constrained systems, providing practical guidance on label design and fundamental limits, with implications for high-rate DNA-based information encoding. values are given by roots of characteristic polynomials (e.g., , , etc.), reflecting the combinatorial structure of labeling constraints.

Abstract

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k + 1)- ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.
Paper Structure (13 sections, 23 theorems, 18 equations, 14 figures, 2 tables)

This paper contains 13 sections, 23 theorems, 18 equations, 14 figures, 2 tables.

Key Result

Theorem 1

Let ${\boldsymbol \alpha}\in\Sigma_q^\ell$ be a non-cyclic label of length $\ell$. Then, $\mathsf{cap}({\boldsymbol \alpha})=\mathsf{cap}(\mathcal{C}_{{\ell-1},\infty})$. That is, $\mathsf{cap}({\boldsymbol \alpha})=\log_2\lambda$ when $\lambda$ is the largest real root of $x^\ell-x^{\ell-1}-1$.

Figures (14)

  • Figure 1: Graph presentation of the constraint in \ref{['periodicExample']}.
  • Figure 2: Graph presentation of the constraint in \ref{['periodicTheorem']}.
  • Figure 3: Graph presentation of the constraint in \ref{['CyclicOverlapExample']}.
  • Figure 4: Graph presentation of the constraint in \ref{['CyclicOverlapTheorem']}.
  • Figure 5: The order between different types of labels of length $\ell \leq 5$, according to their labeling capacity values.
  • ...and 9 more figures

Theorems & Definitions (41)

  • Definition 1
  • Definition 2
  • Example 1
  • Definition 3
  • Example 2
  • Definition 4
  • Definition 5
  • Theorem 1
  • Corollary 1
  • Example 3
  • ...and 31 more