Suffixient Sets
Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza
TL;DR
The paper addresses efficient maximal exact match (MEM) computation for highly repetitive texts by introducing suffixient sets, a run-based concept that upper bounds the portions of the suffix tree needed for MEM discovery. It constructs a compressed index of size $O(\bar{r} + g)$ by combining an SLP-based LCP/LCS data structure with a $z$-fast trie over reversed prefixes at positions in a suffixient set, achieving high-probability MEM queries in time $O\left( \dfrac{m \log \sigma}{\log n} + d \log n \right)$, where $d$ counts edge descents in the suffix tree and $n$ is the text length. The work shows that the suffixient set can be as small as $|S| \le 2\bar{r}$ and relates its size to string attractors via $\gamma \le \chi \le 2\bar{r}$, suggesting strong space-efficiency on repetitive inputs and potential extensions to $O(\chi \log(n/\chi))$-space indices. Overall, the approach offers a practical, near-optimal framework for MEM-based text indexing on pangenomes and other massive repetitive datasets, with a probabilistic element due to hashing.
Abstract
We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs.
