Faster run-length compressed suffix arrays

Nathaniel K. Brown; Travis Gagie; Giovanni Manzini; Gonzalo Navarro; Marinella Sciortino

Faster run-length compressed suffix arrays

Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, Marinella Sciortino

TL;DR

This work improves the query-time efficiency of run-length compressed suffix arrays (RLCSAs) by leveraging a combination of run-splitting (NT21), interpolative coding, and a rank-free query path. The main result is a faster RLCSA that computes the SA interval for a pattern extension $aP$ in $O(\log r_a)$ time without increasing asymptotic space, by replacing rank queries on sparse bitvectors with constant-time selects on interleaved run-structure vectors. The paper also situates the faster RLCSA within two-level indexing frameworks and discusses practical implications for MEM finding and parsing-based indexes, suggesting broader applicability to highly repetitive, large-alphabet texts and pangenome-like datasets. Overall, the approach preserves space while delivering a notable speedup in pattern search, with potential impact on two-level indexes and MEM heuristics in bioinformatics and related domains where run-length BWT representations arise.

Abstract

We first review how we can store a run-length compressed suffix array (RLCSA) for a text $T$ of length $n$ over an alphabet of size $σ$ whose Burrows-Wheeler Transform (BWT) consists of $r$ runs in $O \left( \rule{0ex}{2ex} r \log (n / r) + r \log σ+ σ\right)$ bits such that later, given character $a$ and the suffix array interval for $P$, we can find the suffix-array (SA) interval for $a P$ in $O (\log r_a + \log \log n)$ time, where $r_a$ is the number of runs of copies of $a$ in the BWT. We then show how to modify the RLCSA such that we find the SA interval for $a P$ in only $O (\log r_a)$ time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.

Faster run-length compressed suffix arrays

TL;DR

time without increasing asymptotic space, by replacing rank queries on sparse bitvectors with constant-time selects on interleaved run-structure vectors. The paper also situates the faster RLCSA within two-level indexing frameworks and discusses practical implications for MEM finding and parsing-based indexes, suggesting broader applicability to highly repetitive, large-alphabet texts and pangenome-like datasets. Overall, the approach preserves space while delivering a notable speedup in pattern search, with potential impact on two-level indexes and MEM heuristics in bioinformatics and related domains where run-length BWT representations arise.

Abstract

We first review how we can store a run-length compressed suffix array (RLCSA) for a text

of length

over an alphabet of size

whose Burrows-Wheeler Transform (BWT) consists of

runs in

bits such that later, given character

and the suffix array interval for

, we can find the suffix-array (SA) interval for

time, where

is the number of runs of copies of

in the BWT. We then show how to modify the RLCSA such that we find the SA interval for

in only

time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.

Paper Structure (10 sections, 4 theorems, 12 equations, 6 figures)

This paper contains 10 sections, 4 theorems, 12 equations, 6 figures.

Introduction
Preliminaries
Compressed suffix arrays
Run-length compressed suffix arrays revisited
Faster RLCSAs
Searchable Interpolative coding
Splitting Theorem for RLCSAs
A faster RLCSA without rank queries
Two-level indexing
Boyer-Moore-Li with two-level indexing

Key Result

Theorem 4

We can store $\Psi'$ for $T$ in $O (r (H_0 (L') + 1)) \subseteq O (r \log \sigma)$ bits and support binary search in the increasing interval for a character $a$ in $O (\log r_a)$ time, where $r_a$ is the number of runs of copies of $a$ in the BWT of $T$.

Figures (6)

Figure 1: For $T = \mathtt{CCTGGGCGAT\$CTTACACGAT\$GTTACCAGCT\$CTTACGCGCT\$CTGACGAATT\$CTTACGCGAT\#}$we show $\mathrm{SA}$, $\Psi$, $F$ and $L$ on the left and the $\Psi'$, $F'$ and $L'$ on the right. If we know $\mathrm{SA} [22..28]$ is the SA interval for CG (in the green rectangle on the left) and we want the SA interval for GCG, then we can search in the increasing interval $\Psi [36..48] = 6, 9, 14, 15, 16, 23, 24, 28, 29, 30, 42, 46, 63$for G (in the red rectangle on the left, with $\Psi$ values between 22 and 28 shown as orange arrows and the others shown as black arrows) for the successor $\Psi [41] = 23$ of 22 and the predecessor $\Psi [43] = 28$ of 28. We thus learn that the SA interval for GCG is $\mathrm{SA} [41..43]$ (in the blue rectangle on the left). On the other hand, if we know $\mathrm{SA} [22..28]$ starts at offset 0 in the $L$ run of character $L' [12]$ --- that is, at offset 0 in the 13th run, counting from 1 --- and ends at offset 1 in the $L$ run of character $L' [15]$ (in the green rectangle on the right), then we can search in the increasing interval $\Psi' [25..32] = 1, 3, 7, 13, 15, 22, 25, 39$for G (in the red rectangle, with $\Psi'$ values between 12 and 15 shown as orange arrows and the others shown as black arrows) for the successor $\Psi' [29] = 13$ of 12 and the predecessor $\Psi' [30] = 15$ of 15 (in the blue rectangle on the right). We then use select and rank queries on two $n$-bit sparse vectors to find the SA interval for GCG, the $L$ runs containing that interval's starting and ending positions, and those positions' offsets in those runs.
Figure 2: A balanced binary search tree storing the $k = 13$ keys from the increasing list $6, 9, 14, 15, 16, 23, 24, 28, 29, 30, 42, 46, 63$ with each key in the range $[0..n - 1 = 65]$. When we reach each key in a pre-order traversal or binary search, we know it lies between the two values show to its left and right, so we can encode it as the binary number shown below it, using a total of $O (k \log (n / k) + k)$ bits. If we store a bitvector marking the start of each encoding as visited in an in-order traversal, as shown below the tree, then we can omit the leading 0s from the encodings and support binary search in time $O (\log k)$ without changing our asymptotic space bound.
Figure 3: A set of 50 similar toy genomes of length 50 each, with the first 49 separated by copies of $ and the last one terminated by #.
Figure 4: The 563-number sequence (20 numbers per line) over the alphabet $\{0, \ldots, 90\}$ we get from the concatenation of the toy genomes in Figure \ref{['fig:genomes']} by parsing, replacing each phrase by its rank in the dictionary (counting from 1) and appending a 0.
Figure 5: The RLBWT of the concatenation of the toy genomes shown in Figure \ref{['fig:genomes']}, consisting of 449 runs (20 runs per line).
...and 1 more figures

Theorems & Definitions (7)

Definition 1
Definition 2
Definition 3
Theorem 4
Corollary 5
Theorem 6: Nishimoto and Tabei NT21; Brown, Gagie and Rossi BGR22
Theorem 7

Faster run-length compressed suffix arrays

TL;DR

Abstract

Faster run-length compressed suffix arrays

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)