Permutation Recovery Problem against Deletion Errors for DNA Data Storage
Shubhransh Singhvi, Charchit Gupta, Avital Boruchovsky, Yuval Goldberg, Han Mao Kiah, Eitan Yaakobi
TL;DR
This work addresses permutation recovery in unordered DNA data storage under deletion noise. It formalizes an N-permutation model with addresses and random data, and analyzes identifiability under a binary deletion channel, deriving explicit thresholds L_Th and N_Th that guarantee a unique true permutation with high probability. It also shows that a prior approach relying only on addresses can fail for large address spaces, motivating a combined clustering-and-labeling strategy. The proposed permutation recovery algorithm achieves high success with sub-quadratic data comparisons and provides a concrete bound on the expected number of comparisons, offering practical improvements for robust DNA storage systems.
Abstract
Owing to its immense storage density and durability, DNA has emerged as a promising storage medium. However, due to technological constraints, data can only be written onto many short DNA molecules called data blocks that are stored in an unordered way. To handle the unordered nature of DNA data storage systems, a unique address is typically prepended to each data block to form a DNA strand. However, DNA storage systems are prone to errors and generate multiple noisy copies of each strand called DNA reads. Thus, we study the permutation recovery problem against deletions errors for DNA data storage. The permutation recovery problem for DNA data storage requires one to reconstruct the addresses or in other words to uniquely identify the noisy reads. By successfully reconstructing the addresses, one can essentially determine the correct order of the data blocks, effectively solving the clustering problem. We first show that we can almost surely identify all the noisy reads under certain mild assumptions. We then propose a permutation recovery procedure and analyze its complexity.
