Table of Contents
Fetching ...

Optimizing Bloom Filters for Modern GPU Architectures

Daniel Jünger, Kevin Kristensen, Yunsong Wang, Xiangyao Yu, Bertil Schmidt

TL;DR

The paper tackles accelerating Bloom filters on modern GPUs by exploring a parametric vectorization space (Φ,Θ), adaptive thread-cooperation, and branchless hashing to maximize throughput while preserving high accuracy. It systematically compares variants (CBF, BBF, RBBF, SBF, CSBF) under cache-resident and DRAM-resident regimes, and demonstrates substantial gains over state-of-the-art GPU and CPU baselines, achieving up to 92% of the empirical memory-bound limits. The authors present a modular CUDA/C++ implementation and show that, by aligning block sizes with GPU sector granularity, the approach saturates memory bandwidth in DRAM-bound settings and reduces compute bottlenecks in cache-bound settings. Overall, the work provides practical guidance for deploying high-throughput AMQ structures on accelerators, with strong empirical evidence across multiple GPU architectures and configurations, and offers open-source tooling forthcoming.

Abstract

Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs, with massive thread-level parallelism and high-bandwidth memory, are a natural fit for accelerating these Bloom filter variants potentially to billions of operations per second. Although CPU-optimized implementations have been well studied, GPU designs remain underexplored. We close this gap by exploring the design space on GPUs along three dimensions: vectorization, thread cooperation, and compute latency. Our evaluation shows that the combination of these optimization points strongly affects throughput, with the largest gains achieved when the filter fits within the GPU's cache domain. We examine how the hardware responds to different parameter configurations and relate these observations to measured performance trends. Crucially, our optimized design overcomes the conventional trade-off between speed and precision, delivering the throughput typically restricted to high-error variants while maintaining the superior accuracy of high-precision configurations. At iso error rate, the proposed method outperforms the state-of-the-art by $11.35\times$ ($15.4\times$) for bulk filter lookup (construction), respectively, achieving above $92\%$ of the practical speed-of-light across a wide range of configurations on a B200 GPU. We propose a modular CUDA/C++ implementation, which will be openly available soon.

Optimizing Bloom Filters for Modern GPU Architectures

TL;DR

The paper tackles accelerating Bloom filters on modern GPUs by exploring a parametric vectorization space (Φ,Θ), adaptive thread-cooperation, and branchless hashing to maximize throughput while preserving high accuracy. It systematically compares variants (CBF, BBF, RBBF, SBF, CSBF) under cache-resident and DRAM-resident regimes, and demonstrates substantial gains over state-of-the-art GPU and CPU baselines, achieving up to 92% of the empirical memory-bound limits. The authors present a modular CUDA/C++ implementation and show that, by aligning block sizes with GPU sector granularity, the approach saturates memory bandwidth in DRAM-bound settings and reduces compute bottlenecks in cache-bound settings. Overall, the work provides practical guidance for deploying high-throughput AMQ structures on accelerators, with strong empirical evidence across multiple GPU architectures and configurations, and offers open-source tooling forthcoming.

Abstract

Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs, with massive thread-level parallelism and high-bandwidth memory, are a natural fit for accelerating these Bloom filter variants potentially to billions of operations per second. Although CPU-optimized implementations have been well studied, GPU designs remain underexplored. We close this gap by exploring the design space on GPUs along three dimensions: vectorization, thread cooperation, and compute latency. Our evaluation shows that the combination of these optimization points strongly affects throughput, with the largest gains achieved when the filter fits within the GPU's cache domain. We examine how the hardware responds to different parameter configurations and relate these observations to measured performance trends. Crucially, our optimized design overcomes the conventional trade-off between speed and precision, delivering the throughput typically restricted to high-error variants while maintaining the superior accuracy of high-precision configurations. At iso error rate, the proposed method outperforms the state-of-the-art by () for bulk filter lookup (construction), respectively, achieving above of the practical speed-of-light across a wide range of configurations on a B200 GPU. We propose a modular CUDA/C++ implementation, which will be openly available soon.

Paper Structure

This paper contains 21 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Illustration of five exemplary vectorization layouts for a block size of $B=256$ bits and a word size of $S=32$ bits, which yields $s = B/S = 8$ words per block. The arrangement of labels $w_1,\ldots,w_8$ shows the order in which words are assigned and processed, with execution progressing from bottom to top. For each layout, the annotation on the right indicates the granularity of the memory load instruction (32, 64, 128, 256 bits) applied per step. Repeated vertical arrows denote strided processing in increments of $\Theta \cdot \Phi$. For a more detailed explanation refer to \ref{['sec:impl:vec']}.
  • Figure 2: Adaptive thread cooperation for a bulk filter lookup with $\Theta=4$. The top array represents the input key sequence. Each CUDA thread is initially assigned one key and computes its hash value, storing it in a thread-local register. Subsequently, groups of $\Theta$ consecutive threads process their keys iteratively: in each iteration, the hash value of the active key is broadcast via a register shuffle, after which the group cooperatively performs the filter lookup. Each thread maintains a register for the result of its assigned key. Once all keys have been processed, the results are written back in a coalesced manner. For details, see \ref{['sec:impl:adapt']}.
  • Figure 3: Throughput vs. false-positive rate frontier on NVIDIA B200. The top row ((a), (b)) shows performance for a 32 MB (L2-resident) filter, while the bottom row ((c), (d)) shows a 1 GB (DRAM-resident) filter. Data points are annotated with their corresponding block size $B$ in bits. The solid red line in the bottom row represents the practical speed-of-light (SOL) limit for random memory accesses.
  • Figure 4: Bulk construction throughput of a 32MB SBF filter across various GPU architectures.
  • Figure 5: Bulk lookup throughput of a 32MB SBF filter across various GPU architectures.
  • ...and 3 more figures