Table of Contents
Fetching ...

DNA Storage in the Short Molecule Regime

Ran Tamir, Nir Weinberger, Albert Guillén i Fàbregas

TL;DR

This work resolves the conjectured scaling of reliably storable information in DNA storage with short molecules, proving that for $L=\beta\log M$ and $\beta\in(0,1/\log|\mathcal{A}|)$ the largest reliable codebook grows as $|\mathcal{C}_M| \approx \exp\{ (\frac{1-\beta\log|\mathcal{A}|}{2}) M^{\beta\log|\mathcal{A}|} \log M \}$, matching a converse bound across the regime. It advances by offering a direct random-coding scheme where codewords arise from Dirichlet$(1,...,1)$ PMFs quantized to integer counts, with decoding based on KL-divergence, and by introducing a low-complexity partition coding method that deterministically achieves the same scaling for a broad range of $\beta$ (except very short molecules). The paper also juxtaposes its approach with prior works that relied on Poissonization and memoryless reductions, showing a shorter, more transparent achievability proof. Together, these results deepen the understanding of DNA-based storage limits in the short-molecule regime and provide practically implementable coding strategies with provable performance guarantees. The findings have potential implications for scalable, high-density archival storage using DNA, where molecule lengths must be kept short while still achieving meaningful storage rates.

Abstract

We study the amount of reliable information that can be stored in a DNA-based storage system composed of short DNA molecules. In this regime, Shomorony and Heckel (2022) put forward a conjecture on the scaling of the number of information bits that can be reliably stored. In this paper, we complete the proof of this conjecture. We analyze a random-coding scheme in which each codeword is obtained by quantizing a randomly generated probability mass function drawn from the probability simplex. By analyzing the optimal maximum-likelihood decoder, we derive an achievability bound that matches a recently established converse bound across the entire short-molecule regime. We also propose a second coding scheme, which operates with significantly lower computational complexity but achieves the optimal scaling, except for a specific range of very short molecules.

DNA Storage in the Short Molecule Regime

TL;DR

This work resolves the conjectured scaling of reliably storable information in DNA storage with short molecules, proving that for and the largest reliable codebook grows as , matching a converse bound across the regime. It advances by offering a direct random-coding scheme where codewords arise from Dirichlet PMFs quantized to integer counts, with decoding based on KL-divergence, and by introducing a low-complexity partition coding method that deterministically achieves the same scaling for a broad range of (except very short molecules). The paper also juxtaposes its approach with prior works that relied on Poissonization and memoryless reductions, showing a shorter, more transparent achievability proof. Together, these results deepen the understanding of DNA-based storage limits in the short-molecule regime and provide practically implementable coding strategies with provable performance guarantees. The findings have potential implications for scalable, high-density archival storage using DNA, where molecule lengths must be kept short while still achieving meaningful storage rates.

Abstract

We study the amount of reliable information that can be stored in a DNA-based storage system composed of short DNA molecules. In this regime, Shomorony and Heckel (2022) put forward a conjecture on the scaling of the number of information bits that can be reliably stored. In this paper, we complete the proof of this conjecture. We analyze a random-coding scheme in which each codeword is obtained by quantizing a randomly generated probability mass function drawn from the probability simplex. By analyzing the optimal maximum-likelihood decoder, we derive an achievability bound that matches a recently established converse bound across the entire short-molecule regime. We also propose a second coding scheme, which operates with significantly lower computational complexity but achieves the optimal scaling, except for a specific range of very short molecules.

Paper Structure

This paper contains 12 sections, 6 theorems, 98 equations, 3 figures.

Key Result

Theorem 1

Consider an error-free shuffling-sampling channel with $\beta \in (0,\frac{1}{\log|{\cal A}|})$ and a coverage depth $\xi > 0$. There exists a sequence of codes $\{{\cal C}_M\}_{M \geq 1}$ with vanishing error probabilities ($\varepsilon_M\to 0$), such that

Figures (3)

  • Figure 1: A description of the partition code structure; the set $\{1,2,\ldots,n_{\hbox{\tiny eff}}\}$ of molecule types is partitioned into $\left\lfloor n^{\rho} \right\rfloor$ equal-size subsets of types of molecules. In each subset, each type of molecule has the same number of copies.
  • Figure 2: The counts $\{U_{\boldsymbol{y}}(1),\ldots,U_{\boldsymbol{y}}(n_{\hbox{\tiny eff}})\}$ of all $n_{\hbox{\tiny eff}}$ molecule types, as placed in the order of the encoded partition. Random fluctuations in the count numbers are usually sufficiently low, such that each molecule type is decoded in the correct subset. In the presented case, the count number of a molecule type from ${\cal S}_2$ is relatively low, and the count number of a molecule from ${\cal S}_3$ is relatively high; due to these two large deviations, the decoded partition is incorrect.
  • Figure 3: Comparison between the leading factors in \ref{['RC_density']} and \ref{['LC_density']} as functions of $\beta$ for $|{\cal A}|=2$ and three $\rho$ values ('RC' and 'PC' stands for random coding and partition coding, respectively).

Theorems & Definitions (6)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Corollary 1
  • Proposition 2
  • Lemma 1