QuadRank: Engineering a High Throughput Rank

R. Groot Koerkamp

QuadRank: Engineering a High Throughput Rank

R. Groot Koerkamp

TL;DR

BiRank and QuadRank tackle fast rank queries on binary and DNA alphabets under memory bandwidth constraints by integrating inline L2 deltas, mid-block deltas, and a transposed layout with batching and prefetching. They achieve 3.28% and 14.4% overheads respectively, delivering 1.5–2× speedups over comparable approaches and up to 2× more throughput with prefetching, approaching RAM bandwidth limits. In an FM-index context, QuadFm shows increased efficiency with QuadRank, achieving up to 4× faster performance than Genedex while reducing index size. The results demonstrate that careful memory-access patterns and batching can saturate modern memory systems, enabling high-throughput rank queries for large-scale bioinformatics workloads.

Abstract

Given a text, a query $\mathsf{rank}(q, c)$ counts the number of occurrences of character $c$ among the first $q$ characters of the text. Space-efficient methods to answer these rank queries form an important building block in many succinct data structures. For example, the FM-index is a widely used data structure that uses rank queries to locate all occurrences of a pattern in a text. In bioinformatics applications, the goal is usually to process a given input as fast as possible. Thus, data structures should have high throughput when used with many threads. Contributions. For the binary alphabet, we develop BiRank with 3.28% space overhead. It merges the central ideas of two recent papers: (1) we interleave (inline) offsets in each cache line of the underlying bit vector [Laws et al., 2024], reducing cache-misses, and (2) these offsets are to the middle of each block so that only half of them need popcounting [Gottlieb and Reinert, 2025]. In QuadRank (14.4% space overhead), we extend these techniques to the $σ=4$ (DNA) alphabet. Both data structures require only a single cache miss per query, making them highly suitable for high-throughput and memory-bound settings. To enable efficient batch-processing, we support prefetching the cache lines required to answer upcoming queries. Results. BiRank and QuadRank are around $1.5\times$ and $2\times$ faster than similar-overhead methods that do not use inlining. Prefetching gives an additional $2\times$ speedup, at which point the dual-channel DDR4 RAM bandwidth becomes a hard limit on the total throughput. With prefetching, both methods outperform all other methods apart from SPIDER [Laws et al., 2024] by $2\times$. When using QuadRank with prefetching in a toy count-only FM-index, QuadFm, this results in a smaller size and up to $4\times$ speedup over Genedex, a state-of-the-art batching FM-index implementation.

QuadRank: Engineering a High Throughput Rank

TL;DR

Abstract

Given a text, a query

counts the number of occurrences of character

among the first

characters of the text. Space-efficient methods to answer these rank queries form an important building block in many succinct data structures. For example, the FM-index is a widely used data structure that uses rank queries to locate all occurrences of a pattern in a text. In bioinformatics applications, the goal is usually to process a given input as fast as possible. Thus, data structures should have high throughput when used with many threads. Contributions. For the binary alphabet, we develop BiRank with 3.28% space overhead. It merges the central ideas of two recent papers: (1) we interleave (inline) offsets in each cache line of the underlying bit vector [Laws et al., 2024], reducing cache-misses, and (2) these offsets are to the middle of each block so that only half of them need popcounting [Gottlieb and Reinert, 2025]. In QuadRank (14.4% space overhead), we extend these techniques to the

(DNA) alphabet. Both data structures require only a single cache miss per query, making them highly suitable for high-throughput and memory-bound settings. To enable efficient batch-processing, we support prefetching the cache lines required to answer upcoming queries. Results. BiRank and QuadRank are around

and

faster than similar-overhead methods that do not use inlining. Prefetching gives an additional

speedup, at which point the dual-channel DDR4 RAM bandwidth becomes a hard limit on the total throughput. With prefetching, both methods outperform all other methods apart from SPIDER [Laws et al., 2024] by

. When using QuadRank with prefetching in a toy count-only FM-index, QuadFm, this results in a smaller size and up to

speedup over Genedex, a state-of-the-art batching FM-index implementation.

Paper Structure (12 sections, 4 equations, 3 figures)

This paper contains 12 sections, 4 equations, 3 figures.

Introduction
Background
Further implementations
BiRank
Variants
QuadRank
Variants
Results
BiRank
QuadRank
Conclusion
Code snippets

Figures (3)

Figure 1: Schematic overview of rank data structures. The top and bottom half are for $\sigma=2$ and $\sigma=4$ respectively. Each line shows a data structure (and notable (re)implementations) with its overhead and the layout of a single superblock (not to scale). Each structure stores up to 3 vectors containing (interleaved) superblocks offsets, block deltas, and raw bits. On the right (black) are the blocks containing (bitpacked) data. Each superblock contains a single L1 offset (teal) that is either absolute, or sometimes relative to a 64-bit L0 value (green). They usually count the number of 1-bits/characters before the start of the superblock as indicated by the teal dot, or to the middle of the superblock for pairing variants. L2 deltas (yellow) count from the start/middle of the superblock to the start of each block (yellow dots). Only for poppy they count individual blocks (yellow lines). For pairing, pairing fBV, BiRank, and QuadRank, L2 deltas are to the middle of each (pair of) block(s). AWFM, (pairing) fBV, and QuadRank store the text transposed, alternating words of low and high bits.
Figure 2: Log-log space-time trade-offs for rank structures on binary input of total size 4 GB. The top/middle/bottom row show results for 1/6/12 threads on a CPU with 6 cores. The left/middle/right column show results for the latency, the throughput of a for loop, and the throughput of a for loop with prefetching. Red lines indicate: (left) the roughly 80 ns RAM latency divided by the number of threads, (top mid/right) the 7.5 ns/read maximum random-access RAM throughput of 1 thread, and (rest) the 2.5 ns/cache line total random-access RAM throughput. In the right column, the transparent markers repeat the for-loop throughput. The legend is sorted by increasing overhead.
Figure 3: Space-time trade-off of rank structures on size 4 alphabet on 4 GiB input. Compared to \ref{['birank-plot']}, here we benchmark both $\mathsf{rank}(q, c)$ (small markers), and $\mathsf{rank_4}$ (large markers).

QuadRank: Engineering a High Throughput Rank

TL;DR

Abstract

QuadRank: Engineering a High Throughput Rank

Authors

TL;DR

Abstract

Table of Contents

Figures (3)