Table of Contents
Fetching ...

A Bird-Eye view on DNA Storage Simulators

Sanket Doshi, Mihir Gohel, Manish K. Gupta

TL;DR

This paper surveys software tools for DNA data storage simulation, addressing the cost barriers that hinder real‑world testing and outlining a seven‑step workflow (encoding, synthesis, storage, sequencing, clustering, reconstruction, decoding). It reviews three domain‑focused simulators—Storalator, MESA, and DeepSimulator—detailing how each models distinct parts of the pipeline and highlighting their strengths and limitations, such as encoding/decoding absence, storage modeling, or domain‑specific realism. The authors discuss core concepts like error correction, clustering strategies, and DNN‑based reconstruction to handle contaminated clusters, and they examine practical considerations, including tool usability and scalability. They also point to future directions, notably JPEG DNA for image storage and broader standardization efforts, emphasizing ongoing opportunities to integrate encoding/decoding and cross‑domain noise modeling for more realistic, cost‑effective DNA data storage research.

Abstract

In the current world due to the huge demand for storage, DNA-based storage solution sounds quite promising because of their longevity, low power consumption, and high capacity. However in real life storing data in the form of DNA is quite expensive, and challenging. Therefore researchers and developers develop such kind of software that helps simulate real-life DNA storage without worrying about the cost. This paper aims to review some of the software that performs DNA storage simulations in different domains. The paper also explains the core concepts such as synthesis, sequencing, clustering, reconstruction, GC window, K-mer window, etc and some overview on existing algorithms. Further, we present 3 different softwares on the basis of domain, implementation techniques, and customer/commercial usability.

A Bird-Eye view on DNA Storage Simulators

TL;DR

This paper surveys software tools for DNA data storage simulation, addressing the cost barriers that hinder real‑world testing and outlining a seven‑step workflow (encoding, synthesis, storage, sequencing, clustering, reconstruction, decoding). It reviews three domain‑focused simulators—Storalator, MESA, and DeepSimulator—detailing how each models distinct parts of the pipeline and highlighting their strengths and limitations, such as encoding/decoding absence, storage modeling, or domain‑specific realism. The authors discuss core concepts like error correction, clustering strategies, and DNN‑based reconstruction to handle contaminated clusters, and they examine practical considerations, including tool usability and scalability. They also point to future directions, notably JPEG DNA for image storage and broader standardization efforts, emphasizing ongoing opportunities to integrate encoding/decoding and cross‑domain noise modeling for more realistic, cost‑effective DNA data storage research.

Abstract

In the current world due to the huge demand for storage, DNA-based storage solution sounds quite promising because of their longevity, low power consumption, and high capacity. However in real life storing data in the form of DNA is quite expensive, and challenging. Therefore researchers and developers develop such kind of software that helps simulate real-life DNA storage without worrying about the cost. This paper aims to review some of the software that performs DNA storage simulations in different domains. The paper also explains the core concepts such as synthesis, sequencing, clustering, reconstruction, GC window, K-mer window, etc and some overview on existing algorithms. Further, we present 3 different softwares on the basis of domain, implementation techniques, and customer/commercial usability.
Paper Structure (48 sections, 19 figures, 2 tables)

This paper contains 48 sections, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Workflow of the entire DNA storage process. First a source string (in this case binary string) is broken down into chunks of same length. Then, two stages of encoding process happens, mapping of binary to quaternary (i.e., ACGT) format followed by adding extra bits of information for error correction purpose, by using LDPC codes, Reed Solomon codes, etc. After that these DNA chunks (oligos) are stored into synthetic DNA molecules using DNA synthesizers and those synthetic DNA molecules are stored under certain medium. After that, at the time of reading process, multiple copies of each DNA strands are generated using PCR amplification to easily retrieve necessary information. Then, clustering algorithm is applied to group identical oligos and at the last by using reconstruction followed by decoding process (which is reverse of encoding), original binary file is produced.
  • Figure 2: This figure shows workflow of an ideal DNA storage simulator. It mainly considers 7 steps: first, it takes a source file as an input, breaks it into same sized chunks, maps it to quaternary code words (i.e., ACGT) and add some extra redundancy for error correction. Then, simulator adds some IDS errors which occur during actual synthesis process. Storage and temperature effects also needs to be thoroughly studied. After that, it mimics the PCR process in order to read necessary information files. At last clustering, reconstruction and decoding steps are used to produce the desired output file.
  • Figure 3: Block diagrams of basic DNA storage processes that are linked with Storalator, MESA, and DeepSimulator softwares respectively.
  • Figure 4: Complete workflow of Storalator software. It is divided into four parts. 1. SOLQC or error characterization, 2. Error simulation, which is done by synthesis, PCR, and sequencing, 3. Clustering, and 4. Reconstruction. Image inspired from workshop on Non-Volatile memory by Omer Sabary, Gadi Chaykin, Nili Furman, Dvir Ben Shabat, and Eitan Yaakobi, 2022 storalatorpaper. © Authors, Reprinted with permission.
  • Figure 5: Available algorithms in storalator software. Note that each synthesis technology can only be used with particular sequencing technology(s)
  • ...and 14 more figures