Table of Contents
Fetching ...

Grafite: Taming Adversarial Queries with Optimal Range Filters

Marco Costa, Paolo Ferragina, Giorgio Vinciguerra

TL;DR

Grafite is the only range filter to date to achieve robust and predictable false positive rates across all combinations of datasets, query workloads, and range sizes, while providing faster queries and construction times, and dominating all competitors in the case of correlated queries.

Abstract

Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have addressed this problem heuristically, resulting in high false positive rates and query times when dealing with adversarial inputs, such as in the common scenario where queries are correlated with the keys. We introduce Grafite, a novel range filter that solves these issues with a simple design and clear theoretical guarantees that hold regardless of the input data and query distribution: given a fixed space budget of $B$ bits per key, the query time is $O(1)$, and the false positive probability is upper bounded by $\ell/2^{B-2}$, where $\ell$ is the query range size. Our experimental evaluation shows that Grafite is the only range filter to date to achieve robust and predictable false positive rates across all combinations of datasets, query workloads, and range sizes, while providing faster queries and construction times, and dominating all competitors in the case of correlated queries. As a further contribution, we introduce a very simple heuristic range filter whose performance on uncorrelated queries is very close to or better than the one achieved by the best heuristic range filters proposed in the literature so far.

Grafite: Taming Adversarial Queries with Optimal Range Filters

TL;DR

Grafite is the only range filter to date to achieve robust and predictable false positive rates across all combinations of datasets, query workloads, and range sizes, while providing faster queries and construction times, and dominating all competitors in the case of correlated queries.

Abstract

Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have addressed this problem heuristically, resulting in high false positive rates and query times when dealing with adversarial inputs, such as in the common scenario where queries are correlated with the keys. We introduce Grafite, a novel range filter that solves these issues with a simple design and clear theoretical guarantees that hold regardless of the input data and query distribution: given a fixed space budget of bits per key, the query time is , and the false positive probability is upper bounded by , where is the query range size. Our experimental evaluation shows that Grafite is the only range filter to date to achieve robust and predictable false positive rates across all combinations of datasets, query workloads, and range sizes, while providing faster queries and construction times, and dominating all competitors in the case of correlated queries. As a further contribution, we introduce a very simple heuristic range filter whose performance on uncorrelated queries is very close to or better than the one achieved by the best heuristic range filters proposed in the literature so far.
Paper Structure (14 sections, 4 theorems, 5 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 14 sections, 4 theorems, 5 equations, 7 figures, 1 table, 2 algorithms.

Key Result

theorem 1

Any data structure solving approximate range emptiness queries of fixed length $L \leq u/(5n)$ on $n$ keys drawn from an integer universe $[u] = \{0, \ldots, u - 1 \}$ with a false positive probability of $\varepsilon$ must use at least $n \log(\tfrac{L^{1-O(\varepsilon)}}{\varepsilon}) - O(n)$ bits

Figures (7)

  • Figure 1: Grafite is the only range filter to date that is both effective (low false positive rate) and efficient (low query time) as the endpoints of the query range get closer to the data.
  • Figure 2: An example of Grafite storing the compressed hash codes $6, 14, 32, 51, 53, 55, 66, 70, 91, 94$ (see Example \ref{['ex:encoding']}), and some steps needed for answering a range emptiness query (see Example \ref{['ex:query']}).
  • Figure 3: The majority of range filters provide no filtering (Bucketing, SNARF, SuRF, REncoderSS) or much degraded filtering and query performance (Proteus, REncoderSE) as the key-query correlation increases. An adversary could exploit this weakness to make an attack on the availability of the system employing these heuristic range filters. Instead, Grafite and Rosetta are robust range filters, while REncoder is robust for large range queries. Grafite offers significantly better query time and FPR than Rosetta and REncoder.
  • Figure 4: Comparison among heuristic range filters. In the first row, only Proteus and REncoderSE provide some range query filtering (albeit unsatisfactorily, as discussed in Section \ref{['ssec:exp-robustness']}) because they are auto-tuned on the correlated query workload. In the other rows, a simple solution like Bucketing provides very close or better FPR, and much better query time than all the other heuristic range filters. We remark that, unlike the other range filters, SNARF suffers from false negatives (see Footnote \ref{['foot:snarf_bug']}).
  • Figure 5: Grafite dominates all other robust range filters by providing up to 5 orders of magnitude better FPR and up to 92$\times$ faster queries. These substantial improvements, coupled with its performance guarantees (Corollary \ref{['cor:grafite']}), make Grafite the range filter of choice in applications handling a variety of data distributions and query workloads, even adversarial ones.
  • ...and 2 more figures

Theorems & Definitions (4)

  • theorem 1: goswamiApproximateRangeEmptiness2014
  • lemma 1: goswamiApproximateRangeEmptiness2014
  • theorem 2
  • corollary 1