Suffixient Sets

Lore Depuydt; Travis Gagie; Ben Langmead; Giovanni Manzini; Nicola Prezza

Suffixient Sets

Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

TL;DR

The paper addresses efficient maximal exact match (MEM) computation for highly repetitive texts by introducing suffixient sets, a run-based concept that upper bounds the portions of the suffix tree needed for MEM discovery. It constructs a compressed index of size $O(\bar{r} + g)$ by combining an SLP-based LCP/LCS data structure with a $z$-fast trie over reversed prefixes at positions in a suffixient set, achieving high-probability MEM queries in time $O\left( \dfrac{m \log \sigma}{\log n} + d \log n \right)$, where $d$ counts edge descents in the suffix tree and $n$ is the text length. The work shows that the suffixient set can be as small as $|S| \le 2\bar{r}$ and relates its size to string attractors via $\gamma \le \chi \le 2\bar{r}$, suggesting strong space-efficiency on repetitive inputs and potential extensions to $O(\chi \log(n/\chi))$-space indices. Overall, the approach offers a practical, near-optimal framework for MEM-based text indexing on pangenomes and other massive repetitive datasets, with a probabilistic element due to hashing.

Abstract

We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs.

Suffixient Sets

TL;DR

by combining an SLP-based LCP/LCS data structure with a

-fast trie over reversed prefixes at positions in a suffixient set, achieving high-probability MEM queries in time

, where

counts edge descents in the suffix tree and

is the text length. The work shows that the suffixient set can be as small as

and relates its size to string attractors via

, suggesting strong space-efficiency on repetitive inputs and potential extensions to

-space indices. Overall, the approach offers a practical, near-optimal framework for MEM-based text indexing on pangenomes and other massive repetitive datasets, with a probabilistic element due to hashing.

Abstract

We define a suffixient set for a text

to be a set

of positions between 1 and

such that, for any edge descending from a node

to a node

in the suffix tree of

, there is an element

such that

's path label is a suffix of

and

is the first character of

's edge label. We first show there is a suffixient set of cardinality at most

, where

is the number of runs in the Burrows-Wheeler Transform of the reverse of

. We then show that, given a straight-line program for

with

rules, we can build an

-space index with which, given a pattern

, we can find the maximal exact matches (MEMs) of

with respect to

time, where

is the size of the alphabet and

is the number of times we would fully or partially descend edges in the suffix tree of

while finding those MEMs.

Paper Structure (8 sections, 4 theorems, 7 equations, 4 figures, 1 algorithm)

This paper contains 8 sections, 4 theorems, 7 equations, 4 figures, 1 algorithm.

Introduction
Definitions and Size Bounds
Compressed Index
Acknowledgments.
Disclosure of Interests.
SLP-based LCP/LCS Data Structure
Proof of Lemma \ref{['lem:attractor']}
Omitted Figures

Key Result

lemma thmcounterlemma

The set of positions in $T$ of characters at the at most $2 \bar{r}$ run boundaries in the BWT of the reverse of $T$, is suffixient for $T$.

Figures (4)

Figure 1: The 11 times we fully or partially descend 10 distinct edges in the suffix tree of $T$(above) while finding the 3 MEMs for our example (below). The MEMs are shown boxed in $P$ and $T$, with the characters' colours in $P$ also indicating which path we are following in the tree when we read them. The characters in the box for a MEM that are a different colour from the box are the path label of the the node we reach by suffix links and descend from when finding the end of that MEM. We descend the line alternating blue and green twice.
Figure 2: The set $\{14, 20, 33, 35\}$ is suffixient for $T$ in our example.
Figure 3: The prefixes of $T$ ending at positions in the suffixient set $\{14, 20, 33, 35\}$ in our example (left, in blue), the suffixes of $T$ that immediately follow them (right, in blue), the longest common prefixes of those prefixes of $T$ with the prefixes of $P$ we consider with Algorithm \ref{['alg:MEMs']}(left, in red), and the longest common suffixes of those suffixes of $T$ with the remaining suffixes of $P$(right, in red).
Figure 4: A trace of how Algorithm \ref{['alg:MEMs']} runs on our example.

Theorems & Definitions (9)

definition thmcounterdefinition
definition thmcounterdefinition
lemma thmcounterlemma
proof
lemma thmcounterlemma
lemma thmcounterlemma
proof
theorem thmcountertheorem
proof

Suffixient Sets

TL;DR

Abstract

Suffixient Sets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (9)