Optimal Overlap Detection of Shotgun Reads
Nir Luria, Nir Weinberger
TL;DR
The paper addresses the fundamental problem of detecting the overlap between two short reads drawn from a long sequence, modeling the reads as $\ell=\beta\log n$ in length. It develops a Bayesian MAP detector and derives exact asymptotic Bayesian error probabilities in two regimes: noiseless reads from a stationary ergodic source and memoryless noisy reads through a channel, with the latter assuming a memoryless source as a baseline. The key findings show that the error probability scales as $P_{\text{error}}^{*} \sim 2\big[1+o_{n}(1)\big]\cdot\big(\beta\wedge{1/{\cal H}_{1}(\mathbf{X})}\big)\cdot\frac{\log n}{n}$ in the noiseless case and $P_{\text{error}}^{*} \sim 2\big[1+o_{n}(1)\big]\cdot\big(\beta\wedge{1/{I(Y;\tilde{Y})}}\big)\cdot\frac{\log n}{n}$ in the noisy case, linking detectability to the Shannon entropy rate and the reads’ mutual information. These results reveal the fundamental trade-offs between read length, process statistics, and noise in determining overlap detectability, with implications for sequencing and signal alignment tasks. The work lays a rigorous information-theoretic foundation for overlap detection and informs the design of practical alignment and sketching-based methods in genomics and related signal-processing domains.
Abstract
We consider the problem of detecting the overlap between a pair of short fragments sampled in random locations from an exponentially longer sequence, via their possibly noisy reads. We consider a noiseless setting, in which the reads are noiseless, and the sequence is only assumed to be stationary and ergodic. Under mild conditions on the mixing property of the process generating the sequence, we characterize exactly the asymptotic error probability of the optimal Bayesian detector. Similarly, we consider a noisy setting, in which the reads are noisy versions of the sampled fragments obtained via a memoryless channel. We further assume that the sequence is stationary and memoryless, and similarly characterize exactly the asymptotic error probability of the optimal Bayesian detector for this case.
