DNA Storage in the Short Molecule Regime
Ran Tamir, Nir Weinberger, Albert Guillén i Fàbregas
TL;DR
This work resolves the conjectured scaling of reliably storable information in DNA storage with short molecules, proving that for $L=\beta\log M$ and $\beta\in(0,1/\log|\mathcal{A}|)$ the largest reliable codebook grows as $|\mathcal{C}_M| \approx \exp\{ (\frac{1-\beta\log|\mathcal{A}|}{2}) M^{\beta\log|\mathcal{A}|} \log M \}$, matching a converse bound across the regime. It advances by offering a direct random-coding scheme where codewords arise from Dirichlet$(1,...,1)$ PMFs quantized to integer counts, with decoding based on KL-divergence, and by introducing a low-complexity partition coding method that deterministically achieves the same scaling for a broad range of $\beta$ (except very short molecules). The paper also juxtaposes its approach with prior works that relied on Poissonization and memoryless reductions, showing a shorter, more transparent achievability proof. Together, these results deepen the understanding of DNA-based storage limits in the short-molecule regime and provide practically implementable coding strategies with provable performance guarantees. The findings have potential implications for scalable, high-density archival storage using DNA, where molecule lengths must be kept short while still achieving meaningful storage rates.
Abstract
We study the amount of reliable information that can be stored in a DNA-based storage system composed of short DNA molecules. In this regime, Shomorony and Heckel (2022) put forward a conjecture on the scaling of the number of information bits that can be reliably stored. In this paper, we complete the proof of this conjecture. We analyze a random-coding scheme in which each codeword is obtained by quantizing a randomly generated probability mass function drawn from the probability simplex. By analyzing the optimal maximum-likelihood decoder, we derive an achievability bound that matches a recently established converse bound across the entire short-molecule regime. We also propose a second coding scheme, which operates with significantly lower computational complexity but achieves the optimal scaling, except for a specific range of very short molecules.
