Optimizing Bloom Filters for Modern GPU Architectures
Daniel Jünger, Kevin Kristensen, Yunsong Wang, Xiangyao Yu, Bertil Schmidt
TL;DR
The paper tackles accelerating Bloom filters on modern GPUs by exploring a parametric vectorization space (Φ,Θ), adaptive thread-cooperation, and branchless hashing to maximize throughput while preserving high accuracy. It systematically compares variants (CBF, BBF, RBBF, SBF, CSBF) under cache-resident and DRAM-resident regimes, and demonstrates substantial gains over state-of-the-art GPU and CPU baselines, achieving up to 92% of the empirical memory-bound limits. The authors present a modular CUDA/C++ implementation and show that, by aligning block sizes with GPU sector granularity, the approach saturates memory bandwidth in DRAM-bound settings and reduces compute bottlenecks in cache-bound settings. Overall, the work provides practical guidance for deploying high-throughput AMQ structures on accelerators, with strong empirical evidence across multiple GPU architectures and configurations, and offers open-source tooling forthcoming.
Abstract
Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs, with massive thread-level parallelism and high-bandwidth memory, are a natural fit for accelerating these Bloom filter variants potentially to billions of operations per second. Although CPU-optimized implementations have been well studied, GPU designs remain underexplored. We close this gap by exploring the design space on GPUs along three dimensions: vectorization, thread cooperation, and compute latency. Our evaluation shows that the combination of these optimization points strongly affects throughput, with the largest gains achieved when the filter fits within the GPU's cache domain. We examine how the hardware responds to different parameter configurations and relate these observations to measured performance trends. Crucially, our optimized design overcomes the conventional trade-off between speed and precision, delivering the throughput typically restricted to high-error variants while maintaining the superior accuracy of high-precision configurations. At iso error rate, the proposed method outperforms the state-of-the-art by $11.35\times$ ($15.4\times$) for bulk filter lookup (construction), respectively, achieving above $92\%$ of the practical speed-of-light across a wide range of configurations on a B200 GPU. We propose a modular CUDA/C++ implementation, which will be openly available soon.
