Table of Contents
Fetching ...

Engineering Rank/Select Data Structures for Large-Alphabet Strings

Diego Arroyuelo, Gabriel Carmona, Héctor Larrañaga, Francisco Riveros, Carlos Eugenio Rojas-Morales, Erick Sepúlveda

TL;DR

This work addresses the challenge of supporting rank and select on large-alphabet strings with practical efficiency. It advances the alphabet-partitioning approach by replacing the central t-mapping with per-partition bit vectors, enabling compressed space close to $nH_0(s)$ while delivering fast queries. Through extensive experiments, the authors demonstrate substantial speedups in key applications—such as snippet extraction, inverted-list intersection, and RLFM-index style counting—often at modest space overhead, and they also show strong potential for distributed computation. The results indicate that the proposed ASAP framework provides a practical and scalable alternative for large-alphabet text processing in information retrieval and beyond.

Abstract

Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations $rank$ and $select$ efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation $select$ can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of $rank$ and $select$ operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster.

Engineering Rank/Select Data Structures for Large-Alphabet Strings

TL;DR

This work addresses the challenge of supporting rank and select on large-alphabet strings with practical efficiency. It advances the alphabet-partitioning approach by replacing the central t-mapping with per-partition bit vectors, enabling compressed space close to while delivering fast queries. Through extensive experiments, the authors demonstrate substantial speedups in key applications—such as snippet extraction, inverted-list intersection, and RLFM-index style counting—often at modest space overhead, and they also show strong potential for distributed computation. The results indicate that the proposed ASAP framework provides a practical and scalable alternative for large-alphabet text processing in information retrieval and beyond.

Abstract

Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations and efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of and operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster.
Paper Structure (41 sections, 20 equations, 14 figures, 6 tables, 2 algorithms)

This paper contains 41 sections, 20 equations, 14 figures, 6 tables, 2 algorithms.

Figures (14)

  • Figure 1: Alphabet-partitioning data structure for the string $s= \mathtt{alabar\_a\_la\_alabarda}$, assuming 4 sub-alphabets $\Sigma_0$, $\Sigma_1$, $\Sigma_2$, and $\Sigma_3$, the corresponding mapping $t$ and the sub-alphabet strings $s_0$, $s_1$, $s_2$, and $s_3$.
  • Figure 2: Our implementation of the alphabet-partitioning data structure for the string $s= \mathtt{alabar\_a\_la\_alabarda}$, assuming 4 sub-alphabets $\Sigma_0$, $\Sigma_1$, $\Sigma_2$, and $\Sigma_3$. The original mapping $t$ is replaced by bit vectors $B_0$, $B_1$, $B_2$, and $B_3$.
  • Figure 3: Experimental results for operations $\mathsf{rank}$ (above) and $\mathsf{select}$ (below) on the Wikipedia text. The $x$ axis shows the space usage in bits per symbol and starts at $H_0(s) = 12.45$ bits. The $y$ axis shows the average operation time, in microseconds per operation.
  • Figure 4: Experimental results for operation access. The $x$ axis shows the space usage in bits per symbol and starts at $H_0(s) = 12.45$ bits. The $y$ axis shows the average operation $\mathsf{access}$ time, in microseconds per operation.
  • Figure 5: Experimental results for extracting snippets of length $L=100$ (right) and $L=200$ (left). The $x$ axis shows the space usage in bits per symbol and starts at $H_0(s) = 12.45$ bits. The $y$ axis shows the average extraction time, in microseconds per symbol extracted.
  • ...and 9 more figures