Table of Contents
Fetching ...

SynDe: Syndrome-guided Decoding of Raw Nanopore Reads

Anisha Banerjee, Roman Sokolovskii, Thomas Heinis, Antonia Wachter-Zeh, Eirik Rosnes, Alexandre Graell i Amat

Abstract

Nanopore sequencing technology remains highly error-prone, making efficient error correction essential in DNA-based data storage. Prior work addressed high error rates using convolutional codes with their decoder coupled with the basecaller, but such approaches only accommodate a limited number of code classes and incur significant decoding complexity. To overcome these limitations, we propose two algorithms: PrimerSeeker, which efficiently detects primer sequences in raw nanopore sequencing reads, and SynDe, a decoder that operates on the same raw reads and supports any linear error correction code with a low-complexity graphical representation. PrimerSeeker provides primer location estimates close to those of existing approaches while being better suited for real-time primer detection during sequencing. SynDe performs well with convolutional codes augmented with periodic markers, often approaching or exceeding the performance of existing algorithms with a lower time complexity. Remarkably, the confidence scores produced by SynDe reliably identify which of its outputs should be discarded.

SynDe: Syndrome-guided Decoding of Raw Nanopore Reads

Abstract

Nanopore sequencing technology remains highly error-prone, making efficient error correction essential in DNA-based data storage. Prior work addressed high error rates using convolutional codes with their decoder coupled with the basecaller, but such approaches only accommodate a limited number of code classes and incur significant decoding complexity. To overcome these limitations, we propose two algorithms: PrimerSeeker, which efficiently detects primer sequences in raw nanopore sequencing reads, and SynDe, a decoder that operates on the same raw reads and supports any linear error correction code with a low-complexity graphical representation. PrimerSeeker provides primer location estimates close to those of existing approaches while being better suited for real-time primer detection during sequencing. SynDe performs well with convolutional codes augmented with periodic markers, often approaching or exceeding the performance of existing algorithms with a lower time complexity. Remarkably, the confidence scores produced by SynDe reliably identify which of its outputs should be discarded.

Paper Structure

This paper contains 20 sections, 4 equations, 11 figures, 4 tables, 3 algorithms.

Figures (11)

  • Figure 1: Overview of our decoding workflow. A raw signal is first processed by a neural network that generates a probability matrix over all possible states (for instance $k$-mers or CTC tokens). This matrix is fed to our PrimerSeeker algorithm, which estimates the position at which the primer sequence starts in the raw read, say $n_{\textnormal{P}}$. The columns preceding the $n_{\textnormal{P}}$-th column are cropped and the trimmed probability matrix is passed to SynDe, which leverages the syndrome trellis of the adopted convolutional codes and incorporates knowledge of any included marker symbols to perform "code-aware" basecalling, ultimately generating the decoded DNA sequence.
  • Figure 2: Illustration of PrimerSeeker-Lokatt. This example considers $\textrm{C}\textrm{A}\textrm{C}\textrm{G}\textrm{T}\textrm{A}\textrm{G}\textrm{G}$ as the primer and $k=2$. A neural network takes the raw signal as input and returns a probability matrix, say $\boldsymbol{P}$, where the $(i,j)$-th entry denotes the probability that the $i$-th sample resulted from the presence of the $j$-th $k$-mer in the pore, for all possible $k$-mers. This matrix is used by the primer search algorithm, which, for instance, considers two candidate starting positions in the raw read, $a$ and $b$. Starting with two initial beams, one for each candidate, the algorithm extends each by the first $k$-mer $\textrm{C}\textrm{A}$ and scores them according to $\boldsymbol{P}$. Each beam is subsequently propagated through the upcoming samples through either (i) a dwell event, where the current $k$-mer remains in the pore for an additional sample; or (ii) an extension event, where the next nucleotide translocates into the pore, thus transitioning to the next $k$-mer state. At each iteration, the algorithm propagates all beams into two child beams, which are assigned likelihood scores using $\boldsymbol{P}$, and merging any beams that share identical starting and ending positions and represent the same sequence of $k$-mers. Whenever a beam has traversed all $k$-mers corresponding to the target primer sequence, its final score is considered to be the probability that the primer sequence began at its starting position. The algorithm outputs the start position associated with the beam having the highest score.
  • Figure 3: Performance of the primer search algorithm on the dataset from Volkel et al.volkelNominalFAST5FASTQ2024. In both curves A and B, two methods of locating a target sequence in raw reads are compared. For each positive integer $\Delta$ (x-axis), the y-axis shows the fraction of raw reads for which the position estimates from the competing methods lie within $\Delta$ samples of each other. A: PrimerSeeker-CTC when compared against estimates obtained by the method used by BeamTrellis. B: Comparison between PrimerSeeker-Lokatt and estimates obtained using the Guppy basecaller. The Guppy basecaller generates an erroneous sequence along with the raw signal positions of the corresponding base transitions. The starting position of the primer is considered to be the raw read sample that corresponds to the transition into the first base of the basecalled sequence, which matches the primer sequence most closely.
  • Figure 4: Comparison of the decoding performance of SynDe-CTC with BeamTrellischandakOvercomingHighNanopore2020 on the dataset from Volkel et al.volkelNominalFAST5FASTQ2024. The x-axis values indicate the percentage of raw reads that were discarded, while the corresponding y-axis values indicate the FER, i.e., the fraction of the remaining reads that were decoded incorrectly. We observe that for the same fraction of discarded reads, SynDe-CTC achieves a better FER than BeamTrellis despite using codes of slightly higher rates. The identifier of each code CC(M)x-y suggests that the associated memory is x. More details on the codes can be found in Supplementary \ref{['supp:encoding']} (Table \ref{['tab:codes1']} and Table \ref{['tab:codes2']}).
  • Figure 5: Syndrome trellis of a binary convolutional code of length $N=10$ with parity-check matrix given by Eq. (\ref{['eq::parity']}). Every codeword of this code, say ${\boldsymbol c}$, corresponds to a specific path from the starting node to the terminating node. If the $i$-th edge of the path traced by ${\boldsymbol c}$ is dashed, $c_i=0$, and $c_i=1$ otherwise. Each node represents a specific syndrome state, the (compressed) decimal representation of which indicated by the labels of the nodes. For instance, the path marked in green corresponds to the codeword ${\boldsymbol c}=0011111100$. Note that we draw the binary trellis here for illustrative purposes; the quaternary trellis can be straightforwardly obtained by merging every two consecutive trellis sections. In such a quaternary trellis, we may apply the mapping $00\xrightarrow[]{} \textrm{A}$, $11\xrightarrow[]{} \textrm{T}$ to transform the codeword ${\boldsymbol c}$ into the DNA codeword $\textrm{A}\textrm{T}\textrm{T}\textrm{T}\textrm{A}$.
  • ...and 6 more figures