Table of Contents
Fetching ...

Compressing Suffix Trees by Path Decompositions

Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Ragnar Groot Koerkamp, Giovanni Manzini, Nicola Prezza

TL;DR

This work presents a repetition-aware approach to compressing suffix-tree topology via path decompositions of the suffix tree leaves. By using an order-preserving permutation π, the authors define STPDs and derive PDA samples that bound the number of distinct paths by $r$ (or $ar{r}$ for IPA), enabling a suffix-tree-like index that occupies $O(r)$ words on top of a text oracle. The framework supports key navigation and pattern-matching queries with favorable I/O complexity, comparable to or better than RAM-based models in practice, and demonstrates superior locality over the $r$-index on repetitive genome collections. Beyond practical gains, the method offers a general mechanism: for any order-preserving π, PDA_π yields a sampling of the Prefix Array with a provable repetitiveness measure, enabling efficient primary-secondary occurrence localization through two-dimensional orthogonal queries. Empirically, the approach achieves smaller space and faster locate times than competing methods, suggesting a significant step toward I/O-efficient compressed indexes for large, repetitive data.

Abstract

The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the $r$-index (Gagie et al., SODA 2018), fit in space proportional to $r$, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by $r$. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most $r$, strictly improving upon the (tight) $2r$ bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.

Compressing Suffix Trees by Path Decompositions

TL;DR

This work presents a repetition-aware approach to compressing suffix-tree topology via path decompositions of the suffix tree leaves. By using an order-preserving permutation π, the authors define STPDs and derive PDA samples that bound the number of distinct paths by (or for IPA), enabling a suffix-tree-like index that occupies words on top of a text oracle. The framework supports key navigation and pattern-matching queries with favorable I/O complexity, comparable to or better than RAM-based models in practice, and demonstrates superior locality over the -index on repetitive genome collections. Beyond practical gains, the method offers a general mechanism: for any order-preserving π, PDA_π yields a sampling of the Prefix Array with a provable repetitiveness measure, enabling efficient primary-secondary occurrence localization through two-dimensional orthogonal queries. Empirically, the approach achieves smaller space and faster locate times than competing methods, suggesting a significant step toward I/O-efficient compressed indexes for large, repetitive data.

Abstract

The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the -index (Gagie et al., SODA 2018), fit in space proportional to , the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by . This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most , strictly improving upon the (tight) bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.

Paper Structure

This paper contains 16 sections, 2 theorems, 1 equation, 1 figure.

Key Result

theorem thmcountertheorem

Let $\mathcal{T}$ be a text of length $n$ over an alphabet of size $\sigma$. Assume we have access to an oracle supporting longest common extension ($\mathrm{lce}$) and random access queries (extraction of one character) on $\mathcal{T}$ in $O(t)$ time. Then, there is a representation of $\mathcal{T If the text oracle also supports computing a collision-free (on text substrings) hash $\kappa(\math

Figures (1)

  • Figure 1: Overview of our technique. We sort $\mathcal{T}$'s suffixes $\mathcal{T}[i,n]$ (equivalently, suffix tree leaves) by increasing $\pi(i)$. In this example, $\pi$ corresponds to the standard lexicographic order of the text's suffixes (but $\pi$ can be more general). This induces a suffix tree path decomposition (an edge-disjoint set of node-to-leaf paths covering all edges) obtained by always following the leftmost path. At this point, we associate each path with the integer $i$ such the path's label is $\mathcal{T}[i,n]$ (in the figure, we also color each path according to the color of the associated position $i$). In particular, the label from the root to the first path's edge is $\alpha_{i,k} = \mathcal{T}[i-k+1,i]$, where $k$ is the string depth of the first path's edge. Our indexing strategy essentially consists in implicitly colexicographically-sorting strings $\alpha_{i,k}$ in a subset of the Prefix Array called the Path Decomposition Array$\mathrm{PDA}$: the colexicographically-sorted array containing (without duplicates) the first position of each path (in our example: $\mathrm{PDA} = [11,10,3,7,8]$). Observe that, in this example, there are few (five) distinct path labels: indeed, we show that $|\mathrm{PDA}|$ is bounded by universal compressibility measures and that it can be used to support basic suffix tree navigation and pattern matching operations.

Theorems & Definitions (9)

  • remark thmcounterremark
  • theorem thmcountertheorem
  • corollary thmcountercorollary
  • definition thmcounterdefinition
  • definition thmcounterdefinition: $\mathrm{lcp}$, $\mathrm{lcs}$
  • definition thmcounterdefinition: $\mathrm{rlce}$, $\mathrm{llce}$, and $\mathrm{lce}$
  • definition thmcounterdefinition: Order-preserving permutation
  • definition thmcounterdefinition: Generalized Longest Previous Factor Array $\mathrm{LPF}_{\mathcal{S}, \pi}$
  • definition thmcounterdefinition: Path Decomposition Array $\mathrm{PDA}_{\mathcal{S}, \pi}$