Table of Contents
Fetching ...

Incorporating indel channels into average-case analysis of seed-chain-extend

Spencer Gibson, Yun William Yu

TL;DR

This work extends average-case analyses of seed-chain-extend from substitution-only mutation models to channels that include insertions and deletions (indels). It introduces the homologous path and clipping anchors, along with a generalized recoverability framework, to capture the complex dependence and partial-correct anchors induced by indels. The authors prove that the expected recoverability of an optimal seed-chain-extend chain is at least 1 - O(1/√m) and the expected runtime is O(m n^{Cα} log n) under a total mutation rate constraint θ_T < 0.159, with Cα related to θ_T via 3.15·θ_T as a concrete bound. Experimental results corroborate the theory, showing 1 - E(R) decays no slower than m^{-1/2} and runtime scales consistent with the predicted bound, thereby providing theoretical justification for seed-chain-extend in real genomes with indels.

Abstract

Given a sequence $s_1$ of $n$ letters drawn i.i.d. from an alphabet of size $σ$ and a mutated substring $s_2$ of length $m < n$, we often want to recover the mutation history that generated $s_2$ from $s_1$. Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with $k$-mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with $1 - O\left(\frac{1}{\sqrt{m}}\right)$ recoverability, the proportion of the mutation history that is recovered, in $O\left(mn^{2.43θ} \log n\right)$ expected time, where $θ< 0.206$ is the mutation rate under a substitution-only channel and $s_1$ is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is $\ge 1 - O\Bigl(\frac{1}{\sqrt{m}}\Bigr)$ and the expected runtime is $O(mn^{3.15 \cdot θ_T}\log n)$, when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates ($θ_T = θ_i + θ_d + θ_s$) is less than $0.159$.

Incorporating indel channels into average-case analysis of seed-chain-extend

TL;DR

This work extends average-case analyses of seed-chain-extend from substitution-only mutation models to channels that include insertions and deletions (indels). It introduces the homologous path and clipping anchors, along with a generalized recoverability framework, to capture the complex dependence and partial-correct anchors induced by indels. The authors prove that the expected recoverability of an optimal seed-chain-extend chain is at least 1 - O(1/√m) and the expected runtime is O(m n^{Cα} log n) under a total mutation rate constraint θ_T < 0.159, with Cα related to θ_T via 3.15·θ_T as a concrete bound. Experimental results corroborate the theory, showing 1 - E(R) decays no slower than m^{-1/2} and runtime scales consistent with the predicted bound, thereby providing theoretical justification for seed-chain-extend in real genomes with indels.

Abstract

Given a sequence of letters drawn i.i.d. from an alphabet of size and a mutated substring of length , we often want to recover the mutation history that generated from . Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with -mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with recoverability, the proportion of the mutation history that is recovered, in expected time, where is the mutation rate under a substitution-only channel and is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this paper, we generalize those prior results by introducing mathematical machinery to deal with the two new obstacles introduced by indel channels: the dependence of neighboring anchors and the presence of anchors that are only partially correct. We are thus able to prove that the expected recoverability of an optimal chain is and the expected runtime is , when the total mutation rate given by the sum of the substitution, insertion, and deletion mutation rates () is less than .

Paper Structure

This paper contains 18 sections, 34 theorems, 24 equations, 4 figures.

Key Result

lemma thmcounterlemma

Let $t_0 = \frac{1}{2}\ln(\frac{9}{1 + 8\gamma})$ and $\ell = \frac{21k}{\beta}$. With probability $\ge 1 - \frac{2}{n}$, no $k$-mer in $S[p+1:p+m']$ has more than $\frac{1}{t_0}(\frac{2}{\beta} + 1)k$ inserted base pairs in $S'$ and no $\ell$-block in $S[p+1:p+m']$ contracts to size $\le \frac{(1 -

Figures (4)

  • Figure 1.2.1: (A-B) Generalized recoverability. (a) Homologous anchors lie entirely on the path, spurious anchors entirely off, and clipping anchors partially on/off. For the purposes of recoverability, we remove points that share an x- or y-coordinate with a clipped point. (b) Removing points corresponds to allowing alternate just-as-good paths where the clipping anchor is homologous. (C) The match graph resulting from the mutation process that gives $S'$ from $S = \text{TACTTCGC}$, including a deletion (red), insertion (green), substitutions (blue), and matches (clear). Horizontal lines represent corresponding positions between the sequences. Specifically, in $S$, an insertion of the letter T occurs at position $4$, position $5$ is deleted and the characters at positions $6,7,$ and $8$ are mutated.
  • Figure 1.4.1: Recoverability analyses across mutation parameters and corresponding runtime scaling behavior.
  • Figure 1.A.1: The points along the dashed blue line make up the homologous path given the edits turning $S = TACTTCGC$ into $S' = TACTTTAC$ following Fig. \ref{['fig:combined-recoverability-matchgraph']}C. In this example, anchors are matching seeds of length $3$. Anchor A (red dash) is a homologous anchor since it lies entirely on the path. Anchor B (green dash) is a clipping anchor since it lies partially on the path, namely, the midpoint of the anchor does not belong to the homologous path. Anchor C (black dash) is spurious since it lies entirely off the path.
  • Figure 1.A.2: Induced match graph in the substitution-only regime of an initial string of length $4$ with anchors $A(1,3)$ and $A(3,1)$. These anchors violate Yu and Shaw's shaw2023proving conditions for independence and, as can be seen, there exists a cycle in the graph.

Theorems & Definitions (79)

  • definition thmcounterdefinition: Mutation model
  • remark thmcounterremark
  • definition thmcounterdefinition: Homologous path (Inspired by Ganesh and Sy)
  • definition thmcounterdefinition: Key definitions and assumptions
  • definition thmcounterdefinition: Defining non-recoverable regions
  • definition thmcounterdefinition: Generalized recoverability
  • definition thmcounterdefinition: Defining anchor types
  • lemma thmcounterlemma: Defining the Expansion-Contraction ($\bm{EC}$) space
  • proof
  • lemma thmcounterlemma: Working in $\bm{EC}$
  • ...and 69 more