Table of Contents
Fetching ...

Adaptive encodings for small and fast compressed suffix arrays

Diego Díaz-Domínguez, Veli Mäkinen

TL;DR

This work addresses the challenge of scalable pattern search in terabyte-scale, repetitive text collections by rethinking BWT-based compressed suffix arrays. The authors introduce the VLB-tree, an adaptive, cache-friendly encoding that partitions the BWT into variable-length blocks, placing highly compressible regions near the root and refining irregular areas deeper in the tree. They augment the VLB-tree with rank, successor, and toehold-based decoding, extend it to subsampled indices, and integrate $\,phi^{-1}$-based locating, yielding fast query times with space close to the subsampled $r$-index and often better cache performance. Empirical results show substantial query-time improvements over state-of-the-art baselines (including the $r$-index and $sr$-index) while maintaining competitive space; the Move data structure remains faster but at a significantly higher space cost. Overall, the VLB-tree offers a practical, adaptive alternative for large-scale, repetitive indexing with strong cache efficiency and scalable performance.

Abstract

Compressed suffix arrays (CSAs) index large repetitive collections and are key in many text applications. The r-index and its derivatives combine the run-length Burrows-Wheeler Transform (BWT) with suffix array sampling to achieve space proportional to the number of equal-symbol runs in the BWT. While effective for near-identical strings, their size grows quickly as variation increases, since the number of BWT runs is sensitive to edits. Existing approaches typically trade space for query speed, or vice versa, limiting their practicality at large scale. We introduce variable-length blocking (VLB), an encoding technique for BWT-based CSAs that adapts the amount of indexing information to local compressibility. The BWT is recursively divided into blocks of at most w runs (a parameter) and organized into a tree. Compressible regions appear near the root and store little auxiliary data, while incompressible regions lie deeper and retain additional information to speed up access. Queries traverse a short root-to-leaf path followed by a small run scan. This strategy balances space and query speed by transferring bits saved in compressible areas to accelerate access in incompressible ones. Backward search relies on rank and successor queries over the BWT. We introduce a sampling technique that guarantees correctness only along valid backward-search states, reducing space without affecting query performance. We extend VLB to encode the subsampled r-index (sr-index). Experiments show that VLB-based techniques outperform the r-index and sr-index in query time, while retaining space close to that of the sr-index. Compared to the move data structure, VLB offers a more favorable space-time tradeoff.

Adaptive encodings for small and fast compressed suffix arrays

TL;DR

This work addresses the challenge of scalable pattern search in terabyte-scale, repetitive text collections by rethinking BWT-based compressed suffix arrays. The authors introduce the VLB-tree, an adaptive, cache-friendly encoding that partitions the BWT into variable-length blocks, placing highly compressible regions near the root and refining irregular areas deeper in the tree. They augment the VLB-tree with rank, successor, and toehold-based decoding, extend it to subsampled indices, and integrate -based locating, yielding fast query times with space close to the subsampled -index and often better cache performance. Empirical results show substantial query-time improvements over state-of-the-art baselines (including the -index and -index) while maintaining competitive space; the Move data structure remains faster but at a significantly higher space cost. Overall, the VLB-tree offers a practical, adaptive alternative for large-scale, repetitive indexing with strong cache efficiency and scalable performance.

Abstract

Compressed suffix arrays (CSAs) index large repetitive collections and are key in many text applications. The r-index and its derivatives combine the run-length Burrows-Wheeler Transform (BWT) with suffix array sampling to achieve space proportional to the number of equal-symbol runs in the BWT. While effective for near-identical strings, their size grows quickly as variation increases, since the number of BWT runs is sensitive to edits. Existing approaches typically trade space for query speed, or vice versa, limiting their practicality at large scale. We introduce variable-length blocking (VLB), an encoding technique for BWT-based CSAs that adapts the amount of indexing information to local compressibility. The BWT is recursively divided into blocks of at most w runs (a parameter) and organized into a tree. Compressible regions appear near the root and store little auxiliary data, while incompressible regions lie deeper and retain additional information to speed up access. Queries traverse a short root-to-leaf path followed by a small run scan. This strategy balances space and query speed by transferring bits saved in compressible areas to accelerate access in incompressible ones. Backward search relies on rank and successor queries over the BWT. We introduce a sampling technique that guarantees correctness only along valid backward-search states, reducing space without affecting query performance. We extend VLB to encode the subsampled r-index (sr-index). Experiments show that VLB-based techniques outperform the r-index and sr-index in query time, while retaining space close to that of the sr-index. Compared to the move data structure, VLB offers a more favorable space-time tradeoff.
Paper Structure (65 sections, 2 theorems, 9 equations, 9 figures, 1 table, 3 algorithms)

This paper contains 65 sections, 2 theorems, 9 equations, 9 figures, 1 table, 3 algorithms.

Key Result

theorem 1

Let $\mathcal{T}$ be a VLB-tree built on top of the BWT $L$ of $S$ with parameters $f, w$, and $\ell$. Accessing $L[i]$ in $\mathcal{T}$ takes $O(\log_f (\ell/w) + w)$ time.

Figures (9)

  • Figure 1: Components of the $r$-index and $sr$-index ($s=3$) for $S=\texttt{GATTACAT\$AGATACAT\$GATACAT\$GATTAGAT\$GATTAGATA\$}$. Vertical strings are the partial suffixes of $S$ in lexicographical order. The black values in $SA$ correspond to the sample $SA_h$. Values of $SA$ underlined in blue are the marked positions in $B$. The canceled values in $SA$ were originally in $SA_h$ but they were discarded after subsampling. The dashed underlined values in $SA$ are the positions in $B$ that are cleared because of subsampling. Each gray region in $B$ is a partially valid area, with the underlined $0$ (cleared) marking the start of the invalid suffix. In $D$, canceled values are discarded because their bits in $B$ were cleared.
  • Figure 2: VLB-tree built on the BWT $L$ from Figure \ref{['fig:sri']}. The parameters are $\ell=16$, $f=2$, and $w=2$. In nodes $v, u, e$, we label the stored node fields (omitted elsewhere for readability). We also omit array $X$ as it is the same in all nodes ($X=11$, no superblocks), and the pointer array $P$. In the alphabet bitvector $A$, a $1$ indicates that the corresponding parent symbol appears in the node. Subscripts annotate each $1$ (and the run-length sequence $E$) with the original text symbols. Dashed boxes correspond to Example \ref{['ex:simp_vlbt_enc']}.
  • Figure 3: Finding the successor node containing $c$.
  • Figure 4: Sampling of $Z$ arrays. The parentheses are the ranges $SA[sp_l..ep_l]$ that $backwardsearch(P)$ can visit for patterns of length $\leq 4$. The gray block $L[p,q]=L[17..32]$ marks the fragment of a root child (the one in the middle in Figure \ref{['fig:simp_vlbt_enc']}). The gray parentheses to the left highlight the longest left overlap $(a,b)=lmo(p,q)$, and the gray parentheses to the right highlight the longest right overlap $(y,z)=rmo(p,q)$. The alphabet $\Sigma^{(a,p-1)}$ is $\{\texttt{\$}, \texttt{C}, \texttt{G}, \texttt{T}\}$ and the alphabet $\Sigma^{(q+1,z)}$ is $\emptyset$. The $Z$ array for the root child encoding $L[p..q]$ thus stores pointers for $\{\texttt{\$}, \texttt{C}, \texttt{G}, \texttt{T}\}$.
  • Figure 5: VLB-tree $\mathcal{T}_{\phi}$ of the sequence $Q$ generated from subsampled $B$ and $D$ of Figure \ref{['fig:sri']}. The parameters are $\ell=16,f=2$ and $w=4$. Gray elements in $rle(Q)$ are partially valid areas. The number below is the length of the prefix that remains valid (gray regions in $B$ of Figure \ref{['fig:sri']}). Bitvector $V$ and array $M$ are elements of the fast $sr$-index (Section \ref{['sec:sr-vlbtree']}). Dashed lines correspond to Example \ref{['ex:run_split']}.
  • ...and 4 more figures

Theorems & Definitions (3)

  • definition 1
  • theorem 1
  • lemma 1