Table of Contents
Fetching ...

Suffixient Arrays: a New Efficient Suffix Array Compression Technique

Davide Cenzato, Lore Depuydt, Travis Gagie, Sung-Hwan Kim, Giovanni Manzini, Francisco Olivares, Nicola Prezza

TL;DR

This paper presents the Suffixient Array, a tiny subset of the Suffix Array sufficient to locate on-line one pattern occurrence (in general, all its Maximal Exact Matches) via binary search, provided that random access to the text is available.

Abstract

The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practice tend to exhibit poor cache locality and thus significantly slow down queries. In this paper, we propose a new simple and very efficient solution to this problem by presenting the \emph{Suffixient Array}: a tiny subset of the Suffix Array \emph{sufficient} to locate on-line one pattern occurrence (in general, all its Maximal Exact Matches) via binary search, provided that random access to the text is available. We prove that: (i) the Suffixient Array length $χ$ is a strong repetitiveness measure, (ii) unlike most existing repetition-aware indexes such as the $r$-index, our new index is efficient in the I/O model, and (iii) Suffixient Arrays can be computed in linear time and compressed working space. We show experimentally that, when using well-established compressed random access data structures on repetitive collections, the Suffixient Array $\SuA$ is \emph{simultaneously} (i) faster and orders of magnitude smaller than the Suffix Array $\SA$ and (ii) smaller and \emph{one to two orders of magnitude faster} than the $r$-index. With an average pattern matching query time as low as 3.5 ns per character, our new index gets very close to the ultimate lower bound: the RAM throughput of our workstation (1.18 ns per character).

Suffixient Arrays: a New Efficient Suffix Array Compression Technique

TL;DR

This paper presents the Suffixient Array, a tiny subset of the Suffix Array sufficient to locate on-line one pattern occurrence (in general, all its Maximal Exact Matches) via binary search, provided that random access to the text is available.

Abstract

The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely compressible. Several works in the literature showed that the Suffix Array can be compressed, but they all rely on complex succinct data structures which in practice tend to exhibit poor cache locality and thus significantly slow down queries. In this paper, we propose a new simple and very efficient solution to this problem by presenting the \emph{Suffixient Array}: a tiny subset of the Suffix Array \emph{sufficient} to locate on-line one pattern occurrence (in general, all its Maximal Exact Matches) via binary search, provided that random access to the text is available. We prove that: (i) the Suffixient Array length is a strong repetitiveness measure, (ii) unlike most existing repetition-aware indexes such as the -index, our new index is efficient in the I/O model, and (iii) Suffixient Arrays can be computed in linear time and compressed working space. We show experimentally that, when using well-established compressed random access data structures on repetitive collections, the Suffixient Array is \emph{simultaneously} (i) faster and orders of magnitude smaller than the Suffix Array and (ii) smaller and \emph{one to two orders of magnitude faster} than the -index. With an average pattern matching query time as low as 3.5 ns per character, our new index gets very close to the ultimate lower bound: the RAM throughput of our workstation (1.18 ns per character).
Paper Structure (36 sections, 24 theorems, 6 equations, 7 figures, 1 table, 7 algorithms)

This paper contains 36 sections, 24 theorems, 6 equations, 7 figures, 1 table, 7 algorithms.

Key Result

Lemma 10

Let $T[1,n]$ be a text. Let $\mathrm{SA}$ and $\mathrm{LCP}$ denote the Suffix Array and LCP array of $T^{\text{rev}}$. The set $\{n-SA[i]+1 \ :\ i = 1\ \vee\ i=n\ \vee\ \mathrm{BWT}[i]\neq \mathrm{BWT}[i-1]\ \vee\ \mathrm{BWT}[i]\neq \mathrm{BWT}[i+1],\ 1 \le i \le n\}$ of positions on $T$ corresp

Figures (7)

  • Figure 1: A text $T$ with highlighted the positions of the suffixient set $S = \{14, 20, 33, 35\}$. The figure also shows the suffix tree for $T$: each leaf is identified by the starting position in $T$ of the associated suffix; for example, the third leaf from the left has index 29 since it corresponds to the suffix $T[29,35] = \mathsf{001001\$}$. Node colors show how the set $S$ covers the suffix tree: for each edge $(u,v)$, the destination node $v$ has the same color as the position in $T$ covering it. For example, the parent of leaf 29 is colored in green since the edge $(u,v)$ leading to that node (highlighted in yellow) is covered by $T[33]$. Indeed, $u$'s path label is $\mathsf{0010}$ which is a suffix of $T[1,32]$ and $T[33]=\mathsf{0}$ is the first character of $(u,v)$.
  • Figure 2: The $d=11$ times we fully or partially descend 10 distinct edges in the suffix tree of $T$(above) while finding the 3 MEMs for our example (below). The MEMs are shown boxed in $P$ and $T$, with the characters' colors in $P$ also indicating which path we are following in the tree when we read them. The characters in the box for a MEM that are a different color from the box are the path label of the node we reach by suffix links and descend from when finding the end of that MEM. We descend the line alternating blue and green twice. Notice that $11 = d \ll m = 34$.
  • Figure 3: The figure shows how to construct a smallest suffixient set $\mathcal{S}$ for a text $T[1,n]$ with $n=20$ following Lemma \ref{['lem:link LCP - supermaximal']}. To the left of the black arrows: all supermaximal extensions $\alpha\cdot a$ of $T$ and their selected ending positions in $T$, forming a suffixient set. To the right of the black arrows: $\mathrm{SA}$, $\mathrm{LCP}$, $\mathrm{BWT}$ and the sorted suffixes of $T^{\text{rev}}$. In column $\mathrm{SA}[i]$, we highlight in red all positions that are selected to be included in $\mathcal{S}$. In columns Suffixes and $\mathrm{BWT}[i]$, we highlight in green the (reverses of) the supermaximal extensions of $T$. Black arrows show how the selected $\mathrm{SA}$ positions are converted to positions in $T$ using the formula $n-\mathrm{SA}[i]+1$. How to identify positions of $\mathcal{S}$: for each $c$-run break $i$, we decide if the two ranks $i'\in \{i-1,i\}$ should contribute to $\mathcal{S}$ (i.e. if they correspond to a supermaximal extension) as described in Lemma \ref{['lem:link LCP - supermaximal']}. For brevity, we show this decisional procedure only on two run breaks. Consider the $\tt A$-run break at position $i=10$ (highlighted in blue in column $i$). The blue box depicts the corresponding $\mathrm{LCP}$ interval, $\mathrm{LCP}[box(10)] = \mathrm{LCP}[3,13]$. We observe that in $[3,13]$ there are other $\tt A$-run breaks $i"$, such that $\mathrm{LCP}[i"] > \mathrm{LCP}[10]$: those are $i" = 6, 7, 11, 12$. We conclude that text position $n - \mathrm{SA}[10] + 1 = 5$ should not be included in $\mathcal{S}$. Consider now the $\tt A$-run break $i=20$, highlighted in orange in column $i$. Position $i'=20-1 = 19$ is such that $\mathrm{BWT}[i'] = \tt A$. The orange box depicts the corresponding $\mathrm{LCP}$ interval, $\mathrm{LCP}[box(20)] = \mathrm{LCP}[17,20]$. In this case, there is no other $\tt A$-run break in $[17,20]$ with an $\mathrm{LCP}$ value larger than $\mathrm{LCP}[20]=2$. We therefore insert text position $n - SA[i'] + 1 = n - SA[19] + 1 = 12$ in $\mathcal{S}$. Notice that $i=20$ is also a $\tt G$-run break; repeating the above reasoning, one can verify that position $i'=20$ is indeed associated with the supermaximal extension $\tt AT\cdot G$ ending in text position $n-SA[20]+1=9$.
  • Figure 4: The figure shows the data used by Algorithm \ref{['alg:linear-time-algo']} to construct a smallest suffixient set $\mathcal{S}$ for a text $T[1,n]$ with $n = 20$. In addition to the data shown in Figure \ref{['fig:run-example']}, here we include the $\mathrm{LF}$ array. Each position $i$ of this array is the result obtained when updating $\mathrm{LF}[\mathrm{BWT}[i]] = \mathrm{LF}[\mathrm{BWT}[i]] + 1$ in line 9 of Algorithm \ref{['alg:linear-time-algo']}. For brevity, we show how Algorithm \ref{['alg:linear-time-algo']} works only on the $\tt G$-run breaks. Consider the first $\tt G$-run break at position $i = 4$ (highlighted in blue in column $i$). At this stage we have $\mathrm{LF}[G] = occ(T, {\tt \$}) + occ(T, {\tt A}) = 13$ and $R[{\tt G}] = (-1, 0, false)$, so we update $\mathrm{LF}[{\tt G}] = \mathrm{LF}[{\tt G}] + 1 = 14$ (line 9). Since $\mathrm{BWT}[i] ={\tt G}$ (then $i' = i$) and since $R[{\tt G}].len = -1$, then we do not call $\tt{eval}$ on either line 13 or line 14, so $R[{\tt G}].pos$ is not added to $\mathcal{S}$. Now, we have $\mathrm{LCP}[4] = 2 > R[{\tt G}].len = -1$, then we set $R[{\tt G}] = (\mathrm{LCP}[4], n - \mathrm{SA}[4] + 1, true) = (2, 18, true)$ (line 18) and $m = \infty$ (line 21). For the next ${\tt G}$-run break at position $i = 5$ (highlighted in red in column $i$), since we set $m = \min(\mathrm{LCP}[5], \infty) = 2$ and since $\mathrm{LCP}[\mathrm{LF}[{\tt G}]] - 1 = -1 < R[{\tt G}].len = 2$, then we do not add $R[G].pos$ to $\mathcal{S}$ in either line 13 nor line 14. Next, since $\mathrm{LCP}[5] = 2 = R[{\tt G}].len$, we do not update $R[{\tt G}]$ on line 18, and we finish this iteration. The next ${\tt G}$-run break occurs at position $i = 20$ (highlighted in green in column $i$). We set $LF[{\tt G}] = LF[{\tt G}] + 1 = 15$ and $m = \mathrm{LCP}[20] = 2$. Since $i' = i$, we do not call eval on line 13, but since $\mathrm{LCP}[\mathrm{LF}[{\tt G}]] - 1 = 0 < R[{\tt G}].len = 2$, we add $R[{\tt G}].pos = 18$ to $\mathcal{S}$ and we set $R[{\tt G}] = (0, 0, false)$ on the eval calling on line 15. Next, we have $R[{\tt G}].len = 0 < \mathrm{LCP}[20] = 2$, so we update $R[{\tt G}] = (2, 9, true)$. Finally, since $R[{\tt G}].active = true$, we add $R[{\tt G}].pos = 9$ on the final eval calling on line 24.
  • Figure 5: The figure shows the data used by Algorithm \ref{['alg:fm-algo']} to construct a smallest suffixient set $\mathcal{S}$ for a text $T[1,n]$ with $n = 20$. In addition to the data shown in Figure \ref{['fig:run-example']}, here we include the $\mathrm{PSV}$ and $\mathrm{NSV}$ arrays. For brevity, we show how Algorithm \ref{['alg:fm-algo']} works only on the $\tt G$-run breaks. Consider the first $\tt G$-run break at position $i = 4$ (highlighted in blue on column $i$). At this stage we have $i' = i$ and $R[{\tt G}] = (0, 0, false, 21)$. Since $R[{\tt G}].sa\_pos = 0 < 3 = \mathrm{PSV}[4]$ and $i = 4 < 21 = R[{\tt G}].nsv$, then condition on line 8 is satisfied but condition on line 9 is not, so we do not add $R[{\tt G}].text\_pos$ to $\mathcal{S}$ in line 10 and we update $R[{\tt G}] = (i, n - \mathrm{SA}[i'] + 1, true, \mathrm{NSV}[i]) = (4, 18, true, 8)$ on line 12. For the next ${\tt G}$-run break at position $i = 5$ (highlighted in red on column $i$), we have $\mathrm{PSV}[4] = 3 < 4 = R[{\tt G}].sa\_pos$, then condition on line 8 is not satisfied and nothing else is done in this iteration. The next ${\tt G}$-run break occurs at position $i = 20$ (highlighted in red on column $i$). We have $i' = i$. Since $R[{\tt G}].sa\_pos = 4 < 16 = \mathrm{PSV}[20]$ and $R[{\tt G}].nsv = 8 < 20 = i$, we add $R[{\tt G}].text\_pos = 18$ to $\mathcal{S}$ on line 10 and we update $R[{\tt G}] = (i, n - \mathrm{SA}[i'] + 1, true, \mathrm{NSV}[i]) = (20, 9, true, 21)$ on line 12. Finally, since $R[{\tt G}].active = true$, we add $R[{\tt G}].pos = 9$ on the final foreach loop on line 18.
  • ...and 2 more figures

Theorems & Definitions (64)

  • Definition 1: Right-maximal substring
  • Definition 2: Maximal exact match (MEM)
  • Definition 3: Suffix array ($\mathrm{SA}$)MN93
  • Definition 4: Prefix array ($\mathrm{PA}$)
  • Definition 5: LCP array MN93
  • Definition 6: Suffix tree (ST), Weiner73
  • Definition 7: Burrows-Wheeler transform (BWT), BW94
  • Definition 8: Suffixient set - definition based on suffix trees
  • Definition 9: Suffixient set - definition based on right-maximal strings
  • Lemma 10
  • ...and 54 more