Table of Contents
Fetching ...

Suffixient Sets

Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

TL;DR

The paper addresses efficient maximal exact match (MEM) computation for highly repetitive texts by introducing suffixient sets, a run-based concept that upper bounds the portions of the suffix tree needed for MEM discovery. It constructs a compressed index of size $O(\bar{r} + g)$ by combining an SLP-based LCP/LCS data structure with a $z$-fast trie over reversed prefixes at positions in a suffixient set, achieving high-probability MEM queries in time $O\left( \dfrac{m \log \sigma}{\log n} + d \log n \right)$, where $d$ counts edge descents in the suffix tree and $n$ is the text length. The work shows that the suffixient set can be as small as $|S| \le 2\bar{r}$ and relates its size to string attractors via $\gamma \le \chi \le 2\bar{r}$, suggesting strong space-efficiency on repetitive inputs and potential extensions to $O(\chi \log(n/\chi))$-space indices. Overall, the approach offers a practical, near-optimal framework for MEM-based text indexing on pangenomes and other massive repetitive datasets, with a probabilistic element due to hashing.

Abstract

We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs.

Suffixient Sets

TL;DR

The paper addresses efficient maximal exact match (MEM) computation for highly repetitive texts by introducing suffixient sets, a run-based concept that upper bounds the portions of the suffix tree needed for MEM discovery. It constructs a compressed index of size by combining an SLP-based LCP/LCS data structure with a -fast trie over reversed prefixes at positions in a suffixient set, achieving high-probability MEM queries in time , where counts edge descents in the suffix tree and is the text length. The work shows that the suffixient set can be as small as and relates its size to string attractors via , suggesting strong space-efficiency on repetitive inputs and potential extensions to -space indices. Overall, the approach offers a practical, near-optimal framework for MEM-based text indexing on pangenomes and other massive repetitive datasets, with a probabilistic element due to hashing.

Abstract

We define a suffixient set for a text to be a set of positions between 1 and such that, for any edge descending from a node to a node in the suffix tree of , there is an element such that 's path label is a suffix of and is the first character of 's edge label. We first show there is a suffixient set of cardinality at most , where is the number of runs in the Burrows-Wheeler Transform of the reverse of . We then show that, given a straight-line program for with rules, we can build an -space index with which, given a pattern , we can find the maximal exact matches (MEMs) of with respect to in time, where is the size of the alphabet and is the number of times we would fully or partially descend edges in the suffix tree of while finding those MEMs.
Paper Structure (8 sections, 4 theorems, 7 equations, 4 figures, 1 algorithm)

This paper contains 8 sections, 4 theorems, 7 equations, 4 figures, 1 algorithm.

Key Result

lemma thmcounterlemma

The set of positions in $T$ of characters at the at most $2 \bar{r}$ run boundaries in the BWT of the reverse of $T$, is suffixient for $T$.

Figures (4)

  • Figure 1: The 11 times we fully or partially descend 10 distinct edges in the suffix tree of $T$(above) while finding the 3 MEMs for our example (below). The MEMs are shown boxed in $P$ and $T$, with the characters' colours in $P$ also indicating which path we are following in the tree when we read them. The characters in the box for a MEM that are a different colour from the box are the path label of the the node we reach by suffix links and descend from when finding the end of that MEM. We descend the line alternating blue and green twice.
  • Figure 2: The set $\{14, 20, 33, 35\}$ is suffixient for $T$ in our example.
  • Figure 3: The prefixes of $T$ ending at positions in the suffixient set $\{14, 20, 33, 35\}$ in our example (left, in blue), the suffixes of $T$ that immediately follow them (right, in blue), the longest common prefixes of those prefixes of $T$ with the prefixes of $P$ we consider with Algorithm \ref{['alg:MEMs']}(left, in red), and the longest common suffixes of those suffixes of $T$ with the remaining suffixes of $P$(right, in red).
  • Figure 4: A trace of how Algorithm \ref{['alg:MEMs']} runs on our example.

Theorems & Definitions (9)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • proof
  • theorem thmcountertheorem
  • proof