Table of Contents
Fetching ...

CMOSS: A Reliable, Motif-based Columnar Molecular Storage System

Eugenio Marinelli, Yiqing Yan, Virginie Magnone, Pascal Barbry, Raja Appuswamy

TL;DR

CMOSS tackles the error-prone and cost-inefficient nature of DNA data storage by introducing a motif-based vertical (columnar) layout that merges consensus calling with decoding and uses fixed-size extents for random access. The approach enables progressive, column-wise recovery and robust data reconstruction at low sequencing depth, supported by LDPC or RS codes and a scalable OE/OB addressing scheme. Through two wet-lab experiments and extensive simulations, CMOSS demonstrates reduced read/write costs, mitigated coverage bias, and competitive or superior performance compared with state-of-the-art DNA storage methods, while providing open data and pipelines for reproducibility. The work has practical impact by making long-term DNA archival storage more cost-effective and scalable, with a design that tolerates sequencing errors and PCR biases inherent in random-access workflows.

Abstract

The surge in demand for cost-effective, durable long-term archival media, coupled with density limitations of contemporary magnetic media, has resulted in synthetic DNA emerging as a promising new alternative. Despite its benefits, storing data on DNA poses several challenges as the technology used for reading/writing data and achieving random access on DNA are highly error prone. In order to deal with such errors, it is important to design efficient pipelines that can carefully use redundancy to mask errors without amplifying overall cost. In this work, we present Columnar MOlecular Storage System (CMOSS), a novel, end-to-end DNA storage pipeline that can provide error-tolerant data storage at low read/write costs. CMOSS differs from SOTA on three fronts (i) a motif-based, vertical layout in contrast to nucleotide-based horizontal layout used by SOTA, (ii) merged consensus calling and decoding enabled by the vertical layout, and (iii) a flexible, fixed-size, block-based data organization for random access over DNA storage in contrast to the variable-sized, object-based access used by SOTA. Using an in-depth evaluation via simulation studies and real wet-lab experiments, we demonstrate the benefits of various CMOSS design choices. We make the entire pipeline together with the read datasets openly available to the community for faithful reproduction and furthering research.

CMOSS: A Reliable, Motif-based Columnar Molecular Storage System

TL;DR

CMOSS tackles the error-prone and cost-inefficient nature of DNA data storage by introducing a motif-based vertical (columnar) layout that merges consensus calling with decoding and uses fixed-size extents for random access. The approach enables progressive, column-wise recovery and robust data reconstruction at low sequencing depth, supported by LDPC or RS codes and a scalable OE/OB addressing scheme. Through two wet-lab experiments and extensive simulations, CMOSS demonstrates reduced read/write costs, mitigated coverage bias, and competitive or superior performance compared with state-of-the-art DNA storage methods, while providing open data and pipelines for reproducibility. The work has practical impact by making long-term DNA archival storage more cost-effective and scalable, with a design that tolerates sequencing errors and PCR biases inherent in random-access workflows.

Abstract

The surge in demand for cost-effective, durable long-term archival media, coupled with density limitations of contemporary magnetic media, has resulted in synthetic DNA emerging as a promising new alternative. Despite its benefits, storing data on DNA poses several challenges as the technology used for reading/writing data and achieving random access on DNA are highly error prone. In order to deal with such errors, it is important to design efficient pipelines that can carefully use redundancy to mask errors without amplifying overall cost. In this work, we present Columnar MOlecular Storage System (CMOSS), a novel, end-to-end DNA storage pipeline that can provide error-tolerant data storage at low read/write costs. CMOSS differs from SOTA on three fronts (i) a motif-based, vertical layout in contrast to nucleotide-based horizontal layout used by SOTA, (ii) merged consensus calling and decoding enabled by the vertical layout, and (iii) a flexible, fixed-size, block-based data organization for random access over DNA storage in contrast to the variable-sized, object-based access used by SOTA. Using an in-depth evaluation via simulation studies and real wet-lab experiments, we demonstrate the benefits of various CMOSS design choices. We make the entire pipeline together with the read datasets openly available to the community for faithful reproduction and furthering research.
Paper Structure (19 sections, 11 figures, 3 tables)

This paper contains 19 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Example of consensus algorithm applied to a cluster of three strings in case of substitution errors only (a) and insertion/deletion errors (b)-(d).
  • Figure 2: The oligo structure for object based abstraction. UFP: universal forward primer, DBP: database primer, TBLP: table primer, URP: universal reverse primer.
  • Figure 3: The population fraction change (y-axis) and the number of oligos (x-axis) of each table in SSB, TPCH and SYN databases. The '+' purple points are overlapping together because SYN database has 8 tables of uniform size and their population fraction changes are all close to 1.
  • Figure 4: Example comparing the mapping layout of SOTA approach to our CMOSS. (a) and (b) are common to both SOTA and CMOSS. (a) The raw input data are grouped into 4 blocks as highlighted with different colors. (b) Each block in the example contains 12 bits and is encoded using LDPC error correcting code, which add 6 parity bits to each block; the resulting block is then split in 3 smaller chunks (2 containing 12 bits of data and 1 containing the parity bits).(c) SOTA encodes each chunk with one encoding oligo. As a result, each LDPC block is mapped to 3 oligos (with the same color in the picture). (d) On the contrary, CMOSS maps each block using a motif-based approach. Every group of 3 bits maps to a motif (short oligo) of 5 nucleotides. As a result, every chunk of 12 bits maps to two motifs, that are disposed vertically to form a column until the full LDPC block is encoded. Every LDPC block mapped into a column of motif is appended to the previous one: for example the blue column mapping the blue LDPC block is appended to the green column mapping the green LDPC block. Once the desired length for the oligo is reached (2 columns in the example), a new group of columns is started (in the picture, the pink and yellow columns). We refer to a column group as Oligo-Block. We call the set of Oligo-Blocks as Oligo-Extent. This organization facilitates the indexing, as every extent is identified by a pair of primers while oligos across oligo-blocks are identified with indexes. Notice that for sake of simplicity the numbers reported in the figure are limited to this specific example. They are customizable, and in the actual design, we use a LDPC blocks containing 256000 bits, a motif length of 16-nts and groups of 30 bits mapping to a motif.
  • Figure 5: Our CMOSS data writing pipeline (top) shows the binary to DNA encoding pathway with a small example. (1) Binary data are split in blocks and every block is encoded using LDPC error-control code. In the example, every colored line in (2) is a LDPC-block. LDPC blocks are grouped together column-wise to form oligos blocks; they are displaced vertically in groups of 4 bits in the example. Finally, every line in the new layout is indexed (black bits in the figure) as shown in (3). Each column of LDPC block is mapped to nucleotides next: groups of 4 bits in the LDPC block are converted into motifs of 4 nucleotides passing through an associative array mapping values (of 4 bits) to motifs (4). The motifs are displaced such that every column encodes a different LDPC block. Finally, the two oligos-blocks are synthesised as actual DNA strands. Our CMOSS reading pipeline (bottom) shows DNA to the binary decoding pathway. For sake of simplicity, the example assumes sequencing coverage 1x. The first step in decoding is sequencing (5). Then, a clustering algorithm is applied to the noisy reads (6). As the example assumes coverage 1x, every cluster has one read only (6). Within each cluster, we apply consensus only to the first motifs (black nucleotides in the figure), in order to retrieve the indexes. Using the indexes, the clusters are further separated into their corresponding oligos-blocks (7). Then, each oligo-block enters a loop where a set of operations in each iteration is applied to each of its column of motifs. First, a motif based consensus is applied to the first column of motifs (the blue nucleotides) for every cluster. As in this example we have one noisy read per cluster, the motif-consensus will give as result for each cluster the first motif of the read itself. Because of a deletion error in the reads of cluster 1 and cluster 2, two motifs will contain the wrong nucleotide as highlighted with the green nucleotides in (8). Motifs are decoded into bits (9) and error corrected (10). The decoded bits from (10) are sent back to the LDPC encoder (11): the corrected bits are re-encoded as it was done in step (1)-(2). Then, the encoded bits are fed to the motif-encoder as it was done previously in step (4). The corrected version of the motifs in (13) are used to align reads in their corresponding clusters. Because of the realignment (14) we can stop errors to propagate in next motifs. In the example, due to the realignment we can spot two deletions that otherwise we would not be able to identify and shift the next nucleotides (the green nucleotides) accordingly. Then the whole process starting from motif consensus is repeated for the next column of motifs (green nucleotides) until all columns in one oligo-block are processed.
  • ...and 6 more figures