Incorporating indel channels into average-case analysis of seed-chain-extend
Spencer Gibson, Yun William Yu
TL;DR
This work extends average-case analyses of seed-chain-extend from substitution-only mutation models to channels that include insertions and deletions (indels). It introduces the homologous path and clipping anchors, along with a generalized recoverability framework, to capture the complex dependence and partial-correct anchors induced by indels. The authors prove that the expected recoverability of an optimal seed-chain-extend chain is at least 1 - O(1/√m) and the expected runtime is O(m n^{Cα} log n) under a total mutation rate constraint θ_T < 0.159, with Cα related to θ_T via 3.15·θ_T as a concrete bound. Experimental results corroborate the theory, showing 1 - E(R) decays no slower than m^{-1/2} and runtime scales consistent with the predicted bound, thereby providing theoretical justification for seed-chain-extend in real genomes with indels.
Abstract
Given a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m < n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$-mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with $1 - O\left(\frac{1}{\sqrt{m}}\right)$ recoverability, the proportion of the mutation history that is recovered, in $O\left(mn^{2.43θ} \log n\right)$ expected time, where $θ< 0.206$ is the mutation rate under a substitution-only channel and $s_1$ is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is $\ge 1 - O\Bigl(\frac{1}{\sqrt{m}}\Bigr)$ and the expected runtime is $O(mn^{3.15 \cdot θ_T}\log n)$, when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates ($θ_T = θ_i + θ_d + θ_s$) is less than $0.159$.
