Table of Contents
Fetching ...

Optimal Overlap Detection of Shotgun Reads

Nir Luria, Nir Weinberger

TL;DR

The paper addresses the fundamental problem of detecting the overlap between two short reads drawn from a long sequence, modeling the reads as $\ell=\beta\log n$ in length. It develops a Bayesian MAP detector and derives exact asymptotic Bayesian error probabilities in two regimes: noiseless reads from a stationary ergodic source and memoryless noisy reads through a channel, with the latter assuming a memoryless source as a baseline. The key findings show that the error probability scales as $P_{\text{error}}^{*} \sim 2\big[1+o_{n}(1)\big]\cdot\big(\beta\wedge{1/{\cal H}_{1}(\mathbf{X})}\big)\cdot\frac{\log n}{n}$ in the noiseless case and $P_{\text{error}}^{*} \sim 2\big[1+o_{n}(1)\big]\cdot\big(\beta\wedge{1/{I(Y;\tilde{Y})}}\big)\cdot\frac{\log n}{n}$ in the noisy case, linking detectability to the Shannon entropy rate and the reads’ mutual information. These results reveal the fundamental trade-offs between read length, process statistics, and noise in determining overlap detectability, with implications for sequencing and signal alignment tasks. The work lays a rigorous information-theoretic foundation for overlap detection and informs the design of practical alignment and sketching-based methods in genomics and related signal-processing domains.

Abstract

We consider the problem of detecting the overlap between a pair of short fragments sampled in random locations from an exponentially longer sequence, via their possibly noisy reads. We consider a noiseless setting, in which the reads are noiseless, and the sequence is only assumed to be stationary and ergodic. Under mild conditions on the mixing property of the process generating the sequence, we characterize exactly the asymptotic error probability of the optimal Bayesian detector. Similarly, we consider a noisy setting, in which the reads are noisy versions of the sampled fragments obtained via a memoryless channel. We further assume that the sequence is stationary and memoryless, and similarly characterize exactly the asymptotic error probability of the optimal Bayesian detector for this case.

Optimal Overlap Detection of Shotgun Reads

TL;DR

The paper addresses the fundamental problem of detecting the overlap between two short reads drawn from a long sequence, modeling the reads as in length. It develops a Bayesian MAP detector and derives exact asymptotic Bayesian error probabilities in two regimes: noiseless reads from a stationary ergodic source and memoryless noisy reads through a channel, with the latter assuming a memoryless source as a baseline. The key findings show that the error probability scales as in the noiseless case and in the noisy case, linking detectability to the Shannon entropy rate and the reads’ mutual information. These results reveal the fundamental trade-offs between read length, process statistics, and noise in determining overlap detectability, with implications for sequencing and signal alignment tasks. The work lays a rigorous information-theoretic foundation for overlap detection and informs the design of practical alignment and sketching-based methods in genomics and related signal-processing domains.

Abstract

We consider the problem of detecting the overlap between a pair of short fragments sampled in random locations from an exponentially longer sequence, via their possibly noisy reads. We consider a noiseless setting, in which the reads are noiseless, and the sequence is only assumed to be stationary and ergodic. Under mild conditions on the mixing property of the process generating the sequence, we characterize exactly the asymptotic error probability of the optimal Bayesian detector. Similarly, we consider a noisy setting, in which the reads are noisy versions of the sampled fragments obtained via a memoryless channel. We further assume that the sequence is stationary and memoryless, and similarly characterize exactly the asymptotic error probability of the optimal Bayesian detector for this case.

Paper Structure

This paper contains 22 sections, 18 theorems, 234 equations, 3 figures.

Key Result

Lemma 3

Consider the noiseless setting (Def. def: Noiseless setting). Let and let Then, the likelihood of overlap $t\in[\ell]$ given reads $x_{1}^{\ell}(1)$ and $x_{1}^{\ell}(2)$ is and for $t\in-[\ell-1]$ and where we set $\Gamma_{\pm}(0)=1$.

Figures (3)

  • Figure 1: Reads of length $\ell=5$. $X(1)$ and $X(2)$ have a positive overlap of $T=2$, whereas $\tilde{X}(1)$ and $X(2)$ have a negative overlap $T=-2$.
  • Figure 2: A case in which $X(1)$ and $X(2)$ have both overlap of length $\tilde{t}=2$ and $\overline{t}=3$ for $\ell=5$. We show $X(1)$ and the two possible alignments of $X(2)$ to $X(1)$. By relating the two possible alignments we obtain relations between the symbols of $X(1)$. For example, the overlap of length $\tilde{t}$ (top) implies that the sub-sequence marked in red (square) in $X(1)$ equals the sub-sequence marked in blue (x) in $X(2)$. The overlap of length $\overline{t}$ (bottom) implies that the sub-sequence marked in grey (disk) in $X(1)$ also equals the sub-sequence marked in blue (x) in $X(2)$. Both together imply that the sub-sequence marked in red (square) in $X(1)$ equals the sub-sequence marked in grey (disk) in $X(1)$. Using similar arguments, it stems that $X_{I(1)+2}^{I(1)+4}=X_{I(1)+3}^{I(1)+5}$, a repetition of length $\overline{t}=3$ (here it is "tail-biting", but this is not the case for larger differences $\overline{t}-\tilde{t}$).
  • Figure 3: Time indices of pair matching of the modified event (\ref{['eq: pair matching at distant tau modified']}). Here $m=4$, and the symbols $X_{1+4\tau}$ and $X_{1+5\tau}$ are less than $\tau$ time points from $X_{1+s}$, thus their corresponding events are removed from the event in (\ref{['eq: pair matching at distant tau']}) (appearing in dash points in the figure). Symbols with time indices that are less than $\tau$ time points are marked by a circle of a similar color.

Theorems & Definitions (23)

  • Definition 1: Noiseless setting
  • Definition 2: Noisy setting
  • Lemma 3
  • Proposition 4
  • Definition 5: Strong mixing coefficient
  • Theorem 6
  • Example 7: Memoryless sources
  • Example 8: First-order Markov Process (Markov chains)
  • Theorem 9
  • Lemma 10
  • ...and 13 more