Table of Contents
Fetching ...

ε-Cost Sharding: Scaling Hypergraph-Based Static Functions and Filters to Trillions of Keys

Sebastiano Vigna

TL;DR

This work tackles scaling static functions and static filters to trillions of keys by introducing ε-cost sharding, a principled method to bound shard size and eliminate per-shard metadata. By combining fuse graphs with lazy Gaussian elimination and applying the approach to MWHC, the authors produce VFunc and VFilter data structures that incur about 10.5% more space than the information-theoretic lower bound but deliver the fastest query times within the same space, even at trillion-key scales. The method enables parallel and offline construction on commodity hardware, demonstrated by offline builds reaching 60 ns per key for a 1-bit VFilter on up to $10^{12}$ keys. The results present a practical path to ultra-fast, large-scale static retrieval systems, while outlining future directions in fuse-graph theory and parameter-tuning for even tighter performance bounds.

Abstract

We describe a simple and yet very scalable implementation of static functions (VFunc) and of static filters (VFilter) based on hypergraphs. We introduce the idea of ε-cost sharding, which allows us to build structures that can manage trillions of keys, at the same time increasing memory locality in hypergraph-based constructions. Contrarily to the commonly used HEM sharding method, ε-cost sharding does not require to store of additional information, and does not introduce dependencies in the computation chain; its only cost is that of few arithmetical instructions, and of a relative increase ε in space usage. We apply ε-cost sharding to the classical MWHC construction, but we obtain the best result by combining Dietzfelbinger and Walzer's fuse graphs for large shards with lazy Gaussian elimination for small shards. We obtain large structures with an overhead of 10.5% with respect to the information-theoretical lower bound and with a query time that is a few nanoseconds away from the query time of the non-sharded version, which is the fastest currently available within the same space bounds. Besides comparing our structures with a non-sharded version, we contrast its tradeoffs with bumped ribbon constructions, a space-saving alternative to hypergraph-based static functions and filters, which provide optimum space consumption but slow construction and query time (though construction can be parallelized very efficiently). We build offline a trillion-key filter using commodity hardware in just 60 ns/key.

ε-Cost Sharding: Scaling Hypergraph-Based Static Functions and Filters to Trillions of Keys

TL;DR

This work tackles scaling static functions and static filters to trillions of keys by introducing ε-cost sharding, a principled method to bound shard size and eliminate per-shard metadata. By combining fuse graphs with lazy Gaussian elimination and applying the approach to MWHC, the authors produce VFunc and VFilter data structures that incur about 10.5% more space than the information-theoretic lower bound but deliver the fastest query times within the same space, even at trillion-key scales. The method enables parallel and offline construction on commodity hardware, demonstrated by offline builds reaching 60 ns per key for a 1-bit VFilter on up to keys. The results present a practical path to ultra-fast, large-scale static retrieval systems, while outlining future directions in fuse-graph theory and parameter-tuning for even tighter performance bounds.

Abstract

We describe a simple and yet very scalable implementation of static functions (VFunc) and of static filters (VFilter) based on hypergraphs. We introduce the idea of ε-cost sharding, which allows us to build structures that can manage trillions of keys, at the same time increasing memory locality in hypergraph-based constructions. Contrarily to the commonly used HEM sharding method, ε-cost sharding does not require to store of additional information, and does not introduce dependencies in the computation chain; its only cost is that of few arithmetical instructions, and of a relative increase ε in space usage. We apply ε-cost sharding to the classical MWHC construction, but we obtain the best result by combining Dietzfelbinger and Walzer's fuse graphs for large shards with lazy Gaussian elimination for small shards. We obtain large structures with an overhead of 10.5% with respect to the information-theoretical lower bound and with a query time that is a few nanoseconds away from the query time of the non-sharded version, which is the fastest currently available within the same space bounds. Besides comparing our structures with a non-sharded version, we contrast its tradeoffs with bumped ribbon constructions, a space-saving alternative to hypergraph-based static functions and filters, which provide optimum space consumption but slow construction and query time (though construction can be parallelized very efficiently). We build offline a trillion-key filter using commodity hardware in just 60 ns/key.

Paper Structure

This paper contains 18 sections, 23 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Construction and query times for power-of-two bit sizes. BuRR uses the sparse-coefficient variant.
  • Figure 2: Construction and query times for power-of-two-plus-one bit sizes. BuRR uses the interleaved-coefficient variant. Note that the graphs on the right have a very different scale on the vertical axis.
  • Figure 3: Query times under CPU load and memory stress.
  • Figure 4: Construction and query times for index functions.
  • Figure 5: Construction time in the sequential case for sharded (VFilter) and non-sharded fuse graphs. The spike at the end is due to the necessity of forcing low-memory visits to be able to build a structure with $2^{32}$ keys within the memory available in the unsharded case; building a structure with $2^{33}$ keys failed for insufficient memory.