High Information Density and Low Coverage Data Storage in DNA with Efficient Channel Coding Schemes

Yi Ding; Xuan He; Tuan Thanh Nguyen; Wentu Song; Zohar Yakhini; Eitan Yaakobi; Linqiang Pan; Xiaohu Tang; Kui Cai

High Information Density and Low Coverage Data Storage in DNA with Efficient Channel Coding Schemes

Yi Ding, Xuan He, Tuan Thanh Nguyen, Wentu Song, Zohar Yakhini, Eitan Yaakobi, Linqiang Pan, Xiaohu Tang, Kui Cai

TL;DR

This work reports a DNA-based data storage architecture that incorporates efficient channel coding schemes, including different types of error-correcting codes (ECCs) and constrained codes, for both the inner coding and outer coding for the DNA data storage channel.

Abstract

DNA-based data storage has been attracting significant attention due to its extremely high data storage density, low power consumption, and long duration compared to conventional data storage media. Despite the recent advancements in DNA data storage technology, significant challenges remain. In particular, various types of errors can occur during the processes of DNA synthesis, storage, and sequencing, including substitution errors, insertion errors, and deletion errors. Furthermore, the entire oligo may be lost. In this work, we report a DNA-based data storage architecture that incorporates efficient channel coding schemes, including different types of error-correcting codes (ECCs) and constrained codes, for both the inner coding and outer coding for the DNA data storage channel. We also carried out large scale experiments to validate our proposed DNA-based data storage architecture. Specifically, 1.61 and 1.69 MB data were encoded into 30,000 oligos each, with information densities of 1.731 and 1.815, respectively. It has been found that the stored information can be fully recovered without any error at average coverages of 4.5 and 6.0, respectively. This experiment achieved the highest net information density and lowest coverage among existing DNA-based data storage experiments (with standard DNA), with data recovery rates and coverage approaching theoretical optima.

High Information Density and Low Coverage Data Storage in DNA with Efficient Channel Coding Schemes

TL;DR

Abstract

Paper Structure (12 sections, 8 figures, 4 tables)

This paper contains 12 sections, 8 figures, 4 tables.

Introduction
Results
Encoding Schemes
Experimente Design
Raw Data Analysis
Data Recovery
Discussion
Methods
Modified-SRT
Single edit reconstruction code
Modified-R10 Codes
Modified-BFA

Figures (8)

Figure 1: The workflow of the designed DNA-based data storage system with the structure of an oligo. In the experiment, 30,000 oligos of length 296 nucleotides were synthesized for two different coding schemes. In Coding Scheme 1, each oligo has 226 nucleotides as data payload, 8 nucleotides as seed, 14 nucleotides as redundancy to remove homopolymers runs of more than 4, and to detect or correct errors at DNA base level, 2 nucleotides as the indicator of different coding schemes, and 46 nucleotides as primers. In Coding Scheme 2, each oligo has 233 nucleotides as data payload, 8 nucleotides as seed, 7 nucleotides as redundancy to remove homopolymers runs of more than 4, and to detect or correct errors at DNA base level, 2 nucleotides as the indicator of different coding schemes, and 46 nucleotides as primers. After the processes of DNA synthesis and storage, Illumina sequencing was applied to the DNA pool to obtain 24 million sequenced oligos. The source data was recovered after the decoding processes.
Figure 2: Encoding process of the designed DNA-based data storage architecture. For illustration purposes, we consider an input binary file of 30 bits, partitioned into 5 segments of 6 bits each. The seed is assumed to be 2 bits long. See Sections 4.1 to 4.3 for full details of each encoding step.
Figure 3: Read number distribution of all the oligos.
Figure 4: Read number distribution of coverage equals 5x.
Figure 5: The workflow of the downsampling and data recovery experiment. After downsampling, length check was conducted to discard oligos of lengths less than 249 nts or more than 251 nts. Next, the oligos were clustered based on their seeds. The sequence reconstruction was implementing on the sequences sharing the same seed. Thereafter, the constrained code decoding was applied on each sequence, following by the outer ECC decoding to recover the originally stored user data.
...and 3 more figures

High Information Density and Low Coverage Data Storage in DNA with Efficient Channel Coding Schemes

TL;DR

Abstract

High Information Density and Low Coverage Data Storage in DNA with Efficient Channel Coding Schemes

Authors

TL;DR

Abstract

Table of Contents

Figures (8)