Table of Contents
Fetching ...

Brief Announcement: Parallel Construction of Bumped Ribbon Retrieval

Matthias Becht, Hans-Peter Lehmann, Peter Sanders

Abstract

A retrieval data structure stores a static function f : S -> {0,1}^r . For all x in S, it returns the r-bit value f(x), while for other inputs it may return an arbitrary result. The structure cannot answer membership queries, so it does not have to encode S. The information theoretic space lower bound for arbitrary inputs is r|S| bits. Retrieval data structures have widespread applications. They can be used as an approximate membership filter for S by storing fingerprints of the keys in S, where they are faster and more space efficient than Bloom filters. They can also be used as a basic building block of succinct data structures like perfect hash functions. Bumped Ribbon Retrieval (BuRR) [Dillinger et al., SEA'22] is a recently developed retrieval data structure that is fast to construct with a space overhead of less than 1%. The idea is to solve a nearly diagonal system of linear equations to determine a matrix that, multiplied with the hash of each key, gives the desired output values. During solving, BuRR might bump lines of the equation system to another layer of the same data structure. While the paper describes a simple parallel construction based on bumping the keys on thread boundaries, it does not give an implementation. In this brief announcement, we now fill this gap. Our parallel implementation is transparent to the queries. It achieves a speedup of 14 on 32 cores for 8-bit filters. The additional space overhead is 105 bytes per thread, or 105 slots. This matches 0.0007% of the total space consumption when constructing with 1 billion input keys. A large portion of the construction time is spent on parallel sorting.

Brief Announcement: Parallel Construction of Bumped Ribbon Retrieval

Abstract

A retrieval data structure stores a static function f : S -> {0,1}^r . For all x in S, it returns the r-bit value f(x), while for other inputs it may return an arbitrary result. The structure cannot answer membership queries, so it does not have to encode S. The information theoretic space lower bound for arbitrary inputs is r|S| bits. Retrieval data structures have widespread applications. They can be used as an approximate membership filter for S by storing fingerprints of the keys in S, where they are faster and more space efficient than Bloom filters. They can also be used as a basic building block of succinct data structures like perfect hash functions. Bumped Ribbon Retrieval (BuRR) [Dillinger et al., SEA'22] is a recently developed retrieval data structure that is fast to construct with a space overhead of less than 1%. The idea is to solve a nearly diagonal system of linear equations to determine a matrix that, multiplied with the hash of each key, gives the desired output values. During solving, BuRR might bump lines of the equation system to another layer of the same data structure. While the paper describes a simple parallel construction based on bumping the keys on thread boundaries, it does not give an implementation. In this brief announcement, we now fill this gap. Our parallel implementation is transparent to the queries. It achieves a speedup of 14 on 32 cores for 8-bit filters. The additional space overhead is 105 bytes per thread, or 105 slots. This matches 0.0007% of the total space consumption when constructing with 1 billion input keys. A large portion of the construction time is spent on parallel sorting.

Paper Structure

This paper contains 7 sections, 7 figures.

Figures (7)

  • Figure 1: Illustrations of Bumped Ribbon Retrieval (BuRR) dillinger2022fast with $r=2$ bits. Simplified here to ignore bumping.
  • Figure 2: Illustrations of the keys that have to be bumped around the thread boundaries. Note that keys may use the slots in the gap between two threads, they just cannot overlap the thread boundary.
  • Figure 3: Space overhead in bytes with minbpt=1000 with $1^+$-bit and 2-bit thresholds.
  • Figure 4: Space overhead in bytes versus 32-thread speedup for different values of minbpt.
  • Figure 5: Space overhead in bytes with different search strategies for the cut points. Uses 100 million keys, 32 threads, minbpt=100, and a search range of 50.
  • ...and 2 more figures