Table of Contents
Fetching ...

Studying the Cycle Complexity of DNA Synthesis

Amit Zrihan, Eitan Yaakobi, Zohar Yakhini

TL;DR

The paper analyzes the cycle complexity of photolithographic DNA synthesis for data storage by formalizing capacity measures $ ext{cap}(q, ho)$ and $ ext{cap}(q)$ with subsequence counts $M_q(C,L)$ and deletion spheres $D_q(C,t)$. It presents three encoder families to approach capacity across a range of synthesis density $ ho$: a memory-heavy Lookup Table, a linear-time Multi-size Alphabet Encoder, and a Knuth-style Effective Encoder optimized for $ ho= frac{1}{2}$, with concrete rate formulas and asymptotics showing that large alphabet sizes push performance toward high density. A synthesis-cost framework is developed, yielding $ ext{cost}^*(N,C,q, ho)=oldsymbol{ extalpha} C+oldsymbol{eta}N rac{ ho}{ ext{cap}(q, ho)}$ and showing that, under favorable regimes and with increasing $q$, the minimal cost approaches $oldsymbol{ extalpha} C$, while identifying the critical $ ho^*$ via $ rac{ ho}{H( ho)}= rac{1}{ ext{log}_2 q}$. The findings facilitate higher information densities per synthesis cycle and inform cost-aware design choices for DNA-based data storage, with practical implications for alphabet design and encoding algorithms in scalable photolithographic synthesis.

Abstract

Storing data in DNA is being explored as an efficient solution for archiving and in-object storage. Synthesis time and cost remain challenging, significantly limiting some applications at this stage. In this paper we investigate efficient synthesis, as it relates to cyclic synchronized synthesis technologies, such as photolithography. We define performance metrics related to the number of cycles needed for the synthesis of any fixed number of bits. We first expand on some results from the literature related to the channel capacity, addressing densities beyond those covered by prior work. This leads us to develop effective encoding achieving rate and capacity that are higher than previously reported. Finally, we analyze cost based on a parametric definition and determine some bounds and asymptotics. We investigate alphabet sizes that can be larger than 4, both for theoretical completeness and since practical approaches to such schemes were recently suggested and tested in the literature.

Studying the Cycle Complexity of DNA Synthesis

TL;DR

The paper analyzes the cycle complexity of photolithographic DNA synthesis for data storage by formalizing capacity measures and with subsequence counts and deletion spheres . It presents three encoder families to approach capacity across a range of synthesis density : a memory-heavy Lookup Table, a linear-time Multi-size Alphabet Encoder, and a Knuth-style Effective Encoder optimized for , with concrete rate formulas and asymptotics showing that large alphabet sizes push performance toward high density. A synthesis-cost framework is developed, yielding and showing that, under favorable regimes and with increasing , the minimal cost approaches , while identifying the critical via . The findings facilitate higher information densities per synthesis cycle and inform cost-aware design choices for DNA-based data storage, with practical implications for alphabet design and encoding algorithms in scalable photolithographic synthesis.

Abstract

Storing data in DNA is being explored as an efficient solution for archiving and in-object storage. Synthesis time and cost remain challenging, significantly limiting some applications at this stage. In this paper we investigate efficient synthesis, as it relates to cyclic synchronized synthesis technologies, such as photolithography. We define performance metrics related to the number of cycles needed for the synthesis of any fixed number of bits. We first expand on some results from the literature related to the channel capacity, addressing densities beyond those covered by prior work. This leads us to develop effective encoding achieving rate and capacity that are higher than previously reported. Finally, we analyze cost based on a parametric definition and determine some bounds and asymptotics. We investigate alphabet sizes that can be larger than 4, both for theoretical completeness and since practical approaches to such schemes were recently suggested and tested in the literature.

Paper Structure

This paper contains 12 sections, 8 theorems, 26 equations, 4 figures.

Key Result

Lemma 1

For any $q,C,L \in \mathbb{N}, (L\leq C)$ it holds that where the inequality is strict for $2 \leq q < C, 2 \leq C, 0 < L$.

Figures (4)

  • Figure 1: $\mathsf{cap}(q,\rho)$ as a function of $q$ and $\rho$.
  • Figure 2: Information rates achievable by the construction of Section \ref{['section:lookupTables']}, compared to the related capacities $\mathsf{cap}(q,\rho)$, for selected values of $q$. The dashed line represents the maximum achievable rate of the encoder, while the solid line represents the capacity. The value of $d$ for every pair of $(q,\rho)$ was chosen to be the largest integer s.t. $B\leq 32$.
  • Figure 3: Information rates achievable by the construction of Section \ref{['section:MultiSizeAlphabetEncoder']}, compared to the related capacities $\mathsf{cap}(q,\rho)$, for selected values of $q$ and $\rho \in [0,1]$. Dashed lines represent the maximum achievable rate of the encoder of Section \ref{['section:MultiSizeAlphabetEncoder']}, the x markers represent the rate achievable of the encoder of Section \ref{['section:knuthEncoder']} and solid lines represent capacities.
  • Figure 4: Values of $\frac{2}{q+1}$ and $\rho^*$, as a function of $q$. The space between the lines is the interval of interest, as proven in Theorem \ref{['theorem:IntervalOfInterest']}.

Theorems & Definitions (22)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Remark
  • Claim 1
  • proof
  • Corollary 1
  • ...and 12 more