How to Find Long Maximal Exact Matches and Ignore Short Ones
Travis Gagie
TL;DR
The paper tackles the problem of efficiently locating long maximal exact matches (MEMs) between a pattern $P$ and a text $T$ in large, repetitive, pangenomic data, while ignoring the abundance of short MEMs. It introduces a filtered MEM-finding approach that modifies Li's forward-backward MEM search to report only MEMs with length at least $L$, leveraging LCP/LCS queries and a compact index. The authors demonstrate that, given a compact index with run counts $r$, $\bar{r}$ and a straight-line program with $g$ rules, all MEMs of length at least $L$ can be reported in time $O(m \log \sigma + \mu_{(1-\epsilon)L}\log n)$ (or $O(m + \mu_{(1-\epsilon)L}\log n)$ for polylog alphabets), and, with lazy evaluation, in $O\left(\left(\frac{m}{\epsilon L} + \mu_{(1-\epsilon)L}\right) t(n)\right)$ where $t(n)$ is the cost of a single LCP/LCS query. The approach unifies and improves upon prior methods (Li12 and Goga et al.) by focusing computation on the long, informative MEMs and providing practical, scalable performance for pangenome-scale references. The results enable efficient, memory-conscious MEM discovery crucial for read mapping and genome analysis in highly repetitive datasets.
Abstract
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least $L$ between a pattern of length $m$ and a text of length $n$ in $O (m)$ time plus extra $O (\log n)$ time only for each MEM of length at least nearly $L$ using a compact index for the text, suitable for pangenomics.
