DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA

Ramy Khabbaz; Jérémy Mateos; Marc Antonini; Serge Kas Hanna

DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA

Ramy Khabbaz, Jérémy Mateos, Marc Antonini, Serge Kas Hanna

Abstract

The biochemical processes underlying DNA data storage, including synthesis, amplification, and sequencing, are inherently noisy. Consequently, base-level insertion, deletion, and substitution (IDS) errors, as well as sequence-level dropouts, occur and pose major challenges for reliable data retrieval. Here we introduce DNA-MGC+, a DNA storage codec designed to enable reliable and resource-efficient data retrieval under diverse operating conditions. We evaluate DNA-MGC+ across a wide range of in silico and in vitro settings, including experiments with both Illumina and Nanopore sequencing, and show that it consistently outperforms existing codecs. In particular, DNA-MGC+ achieves simultaneous gains in sequencing depth requirements, read cost, decoding time, storage density, and error-correction capability under explicit reliability constraints. Notable results include reliable decoding under IDS error rates of up to 24% in synthetic scenarios, and reliable retrieval at sequencing depths below 3x with read costs below 3.5 bits/nt under electrochemical synthesis for both Illumina and Nanopore sequencing.

DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA

Abstract

Paper Structure (8 sections, 10 equations, 9 figures, 13 tables)

This paper contains 8 sections, 10 equations, 9 figures, 13 tables.

Encoding.
Filtering.
Decoding.
Stored file and codec configurations.
Oligonucleotide design and synthesis.
Constraint-based filtering.
PCR, library preparation, and sequencing.
Read processing and decoding.

Figures (9)

Figure 1: In silico performance of DNA-MGC+ and comparison codecs under synthetic error and bias models. (a) Schematic of the evaluation pipeline, in which a random 15-KB file is encoded into DNA sequences and decoded after processing through a synthetic channel modeling the end-to-end DNA storage process, parameterized by bias, coverage depth, and error rate. (b) Coverage distributions for the three considered bias regimes, based on a lognormal model with unit mean and standard deviations $\sigma = 0$ (no bias), $\sigma = 0.5$ (moderate bias), and $\sigma = 1$ (strong bias). Coverage distributions are normalized by their mean, which equals the coverage depth, set to 15× in this plot. (c) Dropout rates as a function of coverage depth across the three bias regimes, showing the expected fraction of reference sequences receiving zero reads. (d)--(g) Minimum coverage depth and associated read cost achieved by the three best-performing codecs across the different bias and error combinations. (h) Average decoding times for the moderate bias and 5% error rate scenario, measured at the minimum coverage depth required for reliable decoding. (i) Trade-off between write cost and read cost for the moderate bias and 5% error rate scenario. (j) Reliable decoding region of the DNA-MGC+ codec (low-rate configuration) for the moderate-bias case, highlighting the error rate and coverage depth values for which reliable decoding is achievable under different clustering–alignment combinations.
Figure 2: In silico performance of DNA-MGC+ and comparison codecs under experimentally derived error and bias profiles. (a) Decoding success rate as a function of sequencing depth at a fixed physical redundancy of 100× under the DT4DDS low-fidelity workflow. Solid curves correspond to logistic regression fits of the empirical decoding outcomes. (b) Minimum sequencing depth required for reliable decoding at 100× physical redundancy, together with the corresponding read cost. (c) Maximum achievable storage density, expressed in exabytes per gram of DNA, at a fixed sequencing depth of 30×, computed from the minimum physical redundancy required for reliable decoding.
Figure 3: In vitro performance of DNA-MGC+ and comparison codecs under Illumina and Oxford Nanopore sequencing. (a) Minimum sequencing depth and corresponding read cost required for reliable decoding, obtained via progressive read downsampling, under Illumina and Nanopore sequencing. For Nanopore data, results are reported for multiple Dorado basecalling models with varying computational complexity. (b) Average decoding time required to recover the 24-KB stored file, measured at the minimum sequencing depth needed for reliable decoding for each codec configuration.
Figure 4: Schematic illustration of the DNA-MGC+ encoding process. (1) The input data is partitioned into $K$ fragments, each of length $k$ bits. (2) An outer Reed-Solomon code is applied across fragments to introduce inter-sequence redundancy, where each code symbol consists of $\ell_{\text{out}}$ bits, producing $c_{\text{out}}$ additional sequences. (3) Each sequence is prepended with a unique binary index of length $\ell_{\text{out}}$ bits. (4.1) In the first stage of the inner MGC+ code, the indexed sequences are encoded to introduce binary intra-sequence redundancy, using symbols of $\ell_{\text{in}}$ bits and generating $c_{\text{in}}$guess parities in addition to a single check parity. (4.2) In the second stage of the inner MGC+ code, the resulting binary sequences are mapped to quaternary DNA sequences, followed by the insertion of periodic "$\mathsf{AC}$" markers and further barcoding of the check parity.
Figure A: Trellis representation of the drift sequence $\mathbf{z} = (z_0, z_1, \dots, z_v)$
...and 4 more figures

DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA

Abstract

DNA-MGC+: A versatile codec for reliable and resource-efficient data storage on synthetic DNA

Authors

Abstract

Table of Contents

Figures (9)