Table of Contents
Fetching ...

Trellis BMA: Coded Trace Reconstruction on IDS Channels for DNA Storage

Sundara Rajan Srinivasavaradhan, Sivakanth Gopi, Henry D. Pfister, Sergey Yekhanin

TL;DR

This work tackles coded trace reconstruction for DNA storage by modeling the read process as an IDS channel and introducing a low-complexity reconstruction algorithm, Trellis BMA. The method couples per-trace BCJR in a consensus framework, using a specially constructed multi-trace IDS trellis to maintain tractable inference and achieve near-optimal posterior marginals. Key contributions include a new multi-trace IDS trellis with fewer edges, a BCJR-based consensus decoding scheme with initialization/decoding/half-estimation steps, and a publicly released dataset for benchmarking. The results demonstrate significant error-rate reductions on both simulated and real nanopore data, with inner marker-repeat codes (MR) offering strong performance at high rates and practical decoding complexity improvements for coded trace reconstruction in DNA storage.

Abstract

Sequencing a DNA strand, as part of the read process in DNA storage, produces multiple noisy copies which can be combined to produce better estimates of the original strand; this is called trace reconstruction. One can reduce the error rate further by introducing redundancy in the write sequence and this is called coded trace reconstruction. In this paper, we model the DNA storage channel as an insertion-deletion-substitution (IDS) channel and design both encoding schemes and low-complexity decoding algorithms for coded trace reconstruction. We introduce Trellis BMA, a new reconstruction algorithm whose complexity is linear in the number of traces, and compare its performance to previous algorithms. Our results show that it reduces the error rate on both simulated and experimental data. The performance comparisons in this paper are based on a new dataset of traces that will be publicly released with the paper. Our hope is that this dataset will enable research progress by allowing objective comparisons between candidate algorithms.

Trellis BMA: Coded Trace Reconstruction on IDS Channels for DNA Storage

TL;DR

This work tackles coded trace reconstruction for DNA storage by modeling the read process as an IDS channel and introducing a low-complexity reconstruction algorithm, Trellis BMA. The method couples per-trace BCJR in a consensus framework, using a specially constructed multi-trace IDS trellis to maintain tractable inference and achieve near-optimal posterior marginals. Key contributions include a new multi-trace IDS trellis with fewer edges, a BCJR-based consensus decoding scheme with initialization/decoding/half-estimation steps, and a publicly released dataset for benchmarking. The results demonstrate significant error-rate reductions on both simulated and real nanopore data, with inner marker-repeat codes (MR) offering strong performance at high rates and practical decoding complexity improvements for coded trace reconstruction in DNA storage.

Abstract

Sequencing a DNA strand, as part of the read process in DNA storage, produces multiple noisy copies which can be combined to produce better estimates of the original strand; this is called trace reconstruction. One can reduce the error rate further by introducing redundancy in the write sequence and this is called coded trace reconstruction. In this paper, we model the DNA storage channel as an insertion-deletion-substitution (IDS) channel and design both encoding schemes and low-complexity decoding algorithms for coded trace reconstruction. We introduce Trellis BMA, a new reconstruction algorithm whose complexity is linear in the number of traces, and compare its performance to previous algorithms. Our results show that it reduces the error rate on both simulated and experimental data. The performance comparisons in this paper are based on a new dataset of traces that will be publicly released with the paper. Our hope is that this dataset will enable research progress by allowing objective comparisons between candidate algorithms.

Paper Structure

This paper contains 28 sections, 20 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: (a) The interplay between inner and outer code in a DNA storage system. Data strands are first encoded using an outer code (to correct for missing sequences) and then using an inner code which corrects IDS errors. (b) The inner code architecture for DNA storage. Encoded DNA strands are read or "sequenced" using a sequencing technology, such as Illumina/Nanopore sequencers, and this outputs many noisy copies of the DNA sequence, from which the message vector in the data strand is recovered.
  • Figure 2: Experimental results on real data. Note that Subfigures \ref{['fig:real_6coded_IRs']} and \ref{['fig:real_10coded_IRs']} include the rate loss of their MR codes.
  • Figure 3: Experimental results on simulated data. Note that Subfigures \ref{['fig:sim_6coded_IRs']} and \ref{['fig:sim_10coded_IRs']} include the rate loss of their MR codes. These results are based on simulated data and are not affected by the issue discussed in “Note added on 8/12/2024” in Section III.
  • Figure 4: Hamming error rate for convolutional codes (CC) and marker repeat (MR) codes evaluated using real data for different coding rates with 2 traces.
  • Figure 5: AIRs for convolutional codes (CC) and marker repeat (MR) codes evaluated using real data for different coding rates with 2 traces. The rate loss of the inner code is included in these AIRs.
  • ...and 1 more figures