Faster run-length compressed suffix arrays
Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino
TL;DR
This work improves the query-time efficiency of run-length compressed suffix arrays (RLCSAs) by leveraging a combination of run-splitting (NT21), interpolative coding, and a rank-free query path. The main result is a faster RLCSA that computes the SA interval for a pattern extension $aP$ in $O(\log r_a)$ time without increasing asymptotic space, by replacing rank queries on sparse bitvectors with constant-time selects on interleaved run-structure vectors. The paper also situates the faster RLCSA within two-level indexing frameworks and discusses practical implications for MEM finding and parsing-based indexes, suggesting broader applicability to highly repetitive, large-alphabet texts and pangenome-like datasets. Overall, the approach preserves space while delivering a notable speedup in pattern search, with potential impact on two-level indexes and MEM heuristics in bioinformatics and related domains where run-length BWT representations arise.
Abstract
We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in $O (\log r_a + \log \log n)$ time, where $r_a$ is the number of runs of copies of $a$ in the BWT. We then show how to modify the RLCSA such that we find the SA interval for $a P$ in only $O (\log r_a)$ time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.
