Table of Contents
Fetching ...

Collision Aware Data Allocation In Multi-tube DNA Storage

Yixun Wei, Bingzhe Li, David Du

TL;DR

This paper proposes using a collision-aware data allocation scheme to allocate data with different collisions into different tubes so that a primer banned in a tube because of primer-payload collision can be reused in other tubes thus enhancing the overall storage capacity.

Abstract

DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction (PCR) based DNA storage systems apply primers as specific identifiers to tag different sets of DNA strands. However, if a primer has collisions with any payload in the same DNA tube, the primer cannot safely serve as an identifier and must be disabled in this tube. In a DNA storage system with multiple DNA tubes, the primer-payload collisions can spread over all DNA tubes, repeatedly disable many primers, and cause a significant overall capacity reduction. This paper proposes using a collision-aware data allocation scheme to allocate data with different collisions into different tubes so that a primer banned in a tube because of primer-payload collision can be reused in other tubes. This allocation helps increase the number of usable primers over all tubes thus enhancing the overall storage capacity. The executing time of our scheme is $O(n^2)$ to the number of digital data chunks. The scheme serves as a pre-processing method for any DNA storage system. The evaluation of the state-of-the-art encoding scheme shows that the scheme can increase 20%-25% overall storage capacity.

Collision Aware Data Allocation In Multi-tube DNA Storage

TL;DR

This paper proposes using a collision-aware data allocation scheme to allocate data with different collisions into different tubes so that a primer banned in a tube because of primer-payload collision can be reused in other tubes thus enhancing the overall storage capacity.

Abstract

DNA storage is a promising archival data storage solution to today's big data problem. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. For efficient target data retrieving, existing Polymerase Chain Reaction (PCR) based DNA storage systems apply primers as specific identifiers to tag different sets of DNA strands. However, if a primer has collisions with any payload in the same DNA tube, the primer cannot safely serve as an identifier and must be disabled in this tube. In a DNA storage system with multiple DNA tubes, the primer-payload collisions can spread over all DNA tubes, repeatedly disable many primers, and cause a significant overall capacity reduction. This paper proposes using a collision-aware data allocation scheme to allocate data with different collisions into different tubes so that a primer banned in a tube because of primer-payload collision can be reused in other tubes. This allocation helps increase the number of usable primers over all tubes thus enhancing the overall storage capacity. The executing time of our scheme is to the number of digital data chunks. The scheme serves as a pre-processing method for any DNA storage system. The evaluation of the state-of-the-art encoding scheme shows that the scheme can increase 20%-25% overall storage capacity.
Paper Structure (13 sections, 5 figures)

This paper contains 13 sections, 5 figures.

Figures (5)

  • Figure 1: Workflow of typical DNA storage system
  • Figure 2: standard PCR and defective PCR with primer-payload collision
  • Figure 3: Procedure of Initial Clustering
  • Figure 4: Enhancements of collision aware data allocation in the average number of usable primers per tube and average storage capacity per tube (total five DNA tubes).
  • Figure 5: The trade-offs with different chunk sizes: (a) The average number of collided primers per chunk; (b) The average tube capacity if apply the collision aware data allocation; (c) The average number of sequencings required to retrieve a file when data chunk size changes