Compressing Suffix Trees by Path Decompositions
Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Ragnar Groot Koerkamp, Giovanni Manzini, Nicola Prezza
TL;DR
This work presents a repetition-aware approach to compressing suffix-tree topology via path decompositions of the suffix tree leaves. By using an order-preserving permutation π, the authors define STPDs and derive PDA samples that bound the number of distinct paths by $r$ (or $ar{r}$ for IPA), enabling a suffix-tree-like index that occupies $O(r)$ words on top of a text oracle. The framework supports key navigation and pattern-matching queries with favorable I/O complexity, comparable to or better than RAM-based models in practice, and demonstrates superior locality over the $r$-index on repetitive genome collections. Beyond practical gains, the method offers a general mechanism: for any order-preserving π, PDA_π yields a sampling of the Prefix Array with a provable repetitiveness measure, enabling efficient primary-secondary occurrence localization through two-dimensional orthogonal queries. Empirically, the approach achieves smaller space and faster locate times than competing methods, suggesting a significant step toward I/O-efficient compressed indexes for large, repetitive data.
Abstract
The suffix tree is arguably the most fundamental data structure on strings: introduced by Weiner (SWAT 1973) and McCreight (JACM 1976), it allows solving a myriad of computational problems on strings in linear time. Motivated by its large space usage, subsequent research focused first on reducing its size by a constant factor via Suffix Arrays, and later on reaching space proportional to the size of the compressed string. Modern compressed indexes, such as the $r$-index (Gagie et al., SODA 2018), fit in space proportional to $r$, the number of runs in the Burrows-Wheeler transform (a strong and universal repetitiveness measure). These advances, however, came with a price: while modern compressed indexes boast optimal bounds in the RAM model, they are often orders of magnitude slower than uncompressed counterparts in practice due to catastrophic cache locality. This reality gap highlights that Big-O complexity in the RAM model has become a misleading predictor of real-world performance, leaving a critical question unanswered: can we design compressed indexes that are efficient in the I/O model of computation? We answer this in the affirmative by introducing a new Suffix Array sampling technique based on particular path decompositions of the suffix tree. We prove that sorting the suffix tree leaves by specific priority functions induces a decomposition where the number of distinct paths (each corresponding to a string suffix) is bounded by $r$. This allows us to solve indexed pattern matching efficiently in the I/O model using a Suffix Array sample of size at most $r$, strictly improving upon the (tight) $2r$ bound of Suffixient Arrays, another recent compressed Suffix Array sampling technique.
