Table of Contents
Fetching ...

How to Find Long Maximal Exact Matches and Ignore Short Ones

Travis Gagie

TL;DR

The paper tackles the problem of efficiently locating long maximal exact matches (MEMs) between a pattern $P$ and a text $T$ in large, repetitive, pangenomic data, while ignoring the abundance of short MEMs. It introduces a filtered MEM-finding approach that modifies Li's forward-backward MEM search to report only MEMs with length at least $L$, leveraging LCP/LCS queries and a compact index. The authors demonstrate that, given a compact index with run counts $r$, $\bar{r}$ and a straight-line program with $g$ rules, all MEMs of length at least $L$ can be reported in time $O(m \log \sigma + \mu_{(1-\epsilon)L}\log n)$ (or $O(m + \mu_{(1-\epsilon)L}\log n)$ for polylog alphabets), and, with lazy evaluation, in $O\left(\left(\frac{m}{\epsilon L} + \mu_{(1-\epsilon)L}\right) t(n)\right)$ where $t(n)$ is the cost of a single LCP/LCS query. The approach unifies and improves upon prior methods (Li12 and Goga et al.) by focusing computation on the long, informative MEMs and providing practical, scalable performance for pangenome-scale references. The results enable efficient, memory-conscious MEM discovery crucial for read mapping and genome analysis in highly repetitive datasets.

Abstract

Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least $L$ between a pattern of length $m$ and a text of length $n$ in $O (m)$ time plus extra $O (\log n)$ time only for each MEM of length at least nearly $L$ using a compact index for the text, suitable for pangenomics.

How to Find Long Maximal Exact Matches and Ignore Short Ones

TL;DR

The paper tackles the problem of efficiently locating long maximal exact matches (MEMs) between a pattern and a text in large, repetitive, pangenomic data, while ignoring the abundance of short MEMs. It introduces a filtered MEM-finding approach that modifies Li's forward-backward MEM search to report only MEMs with length at least , leveraging LCP/LCS queries and a compact index. The authors demonstrate that, given a compact index with run counts , and a straight-line program with rules, all MEMs of length at least can be reported in time (or for polylog alphabets), and, with lazy evaluation, in where is the cost of a single LCP/LCS query. The approach unifies and improves upon prior methods (Li12 and Goga et al.) by focusing computation on the long, informative MEMs and providing practical, scalable performance for pangenome-scale references. The results enable efficient, memory-conscious MEM discovery crucial for read mapping and genome analysis in highly repetitive datasets.

Abstract

Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least between a pattern of length and a text of length in time plus extra time only for each MEM of length at least nearly using a compact index for the text, suitable for pangenomics.
Paper Structure (5 sections, 4 theorems, 6 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 5 sections, 4 theorems, 6 equations, 4 figures, 1 table, 2 algorithms.

Key Result

theorem 1

There is an $O (r + \bar{r})$-space index for $T$, where $r$ and $\bar{r}$ are the number of runs in the BWT of $T$ and the reverse of $T$, with which when given $P$ we can compute $\mathrm{MF}$ and $\mathrm{MB}$ in $O (m \log \sigma)$ time, or $O (m)$ time when $\sigma$ is polylogarithmic in $n$.

Figures (4)

  • Figure 1: A randomly chosen string (top) over $\{\mathtt{A}, \mathtt{C}, \mathtt{G}, \mathtt{T}\}$ with the highlighted substring copied (center) and then edited. The differences from the original substring are shown highlighted in red in the copy, with the lengths of the MEMs of the copy with respect to the whole string shown under the copy; 12 is shown as (12) to distinguish it from 1 followed by 2. The occurrences of the MEMs in the whole string (bottom) are shown in black when they have lengths 4, 5 or 6, and in red when they have lengths 8 or 12. Substrings longer than 6 characters shown in black are formed by consecutive or overlapping occurrences of MEMs of length at most 6.
  • Figure 2: The forward-match and backward-match pointers $\mathrm{MF} [1..m]$ and $\mathrm{MB} [1..m]$ of $P = \mathtt{TACATAGATTAG}$ with respect to $T = \mathtt{GATTAGATACAT}$. Since $T [5..12]$ has the longest common prefix AGAT with $P [6..12]$, $\mathrm{MF} [6] = 5$(red); since $T [1..12]$ has the longest common suffix CAT with $P [1..5]$, $\mathrm{MB} [5] = 12$(blue).
  • Figure 3: A trace (top) of how Algorithm \ref{['alg:BF']} processes our example (bottom) of $P = \mathtt{TACATAGATTAG}$ and $T = \mathtt{GATTAGATACAT}$ from Figure \ref{['fig:MFMB']}, with $L = 4$.
  • Figure 4: The two subcases of second cases ($b < L$ in line 4 of Algorithm \ref{['alg:BF']}. When $(1 - \epsilon) L \leq b < L$(top), there is a MEM of length at least $L$ starting at $i_k + L - b$(shown in grey). When $b < (1 - \epsilon) L$(bottom), there are $L - b > \epsilon L$ characters between $i$ and $i + L - b$(shown in grey).

Theorems & Definitions (5)

  • definition 1
  • theorem 1
  • theorem 2
  • theorem 3
  • theorem 4