Table of Contents
Fetching ...

Optimal quantile estimation: beyond the comparison model

Meghal Gupta, Mihir Singhal, Hongxun Wu

TL;DR

The paper addresses deterministic, space-efficient quantile estimation in streaming where the universe size $U$ is finite. It introduces a recursive, eager q-digest-based quantile sketch that layers multiple structures and batches insertions to achieve $O\left(\varepsilon^{-1}\right)$ words of space, improving over prior non-comparison-based approaches and matching or approaching lower bounds in the natural regime. The core idea is to maintain non-full nodes via inner quantile sketches on a reduced universe (the exposed nodes) and recursively apply this idea to shrink the effective universe by roughly a logarithmic factor at each layer, yielding a total space of $O\left(\varepsilon^{-1}(\log(\varepsilon n) + \log(\varepsilon U))\right)$ bits with near-optimal time complexity. The work also discusses practical considerations, such as partial mergeability, constant-factor optimizations, and how to support select queries with real elements, making the approach applicable in real systems and libraries that require tight space bounds for quantile queries.

Abstract

Estimating quantiles is one of the foundational problems of data sketching. Given $n$ elements $x_1, x_2, \dots, x_n$ from some universe of size $U$ arriving in a data stream, a quantile sketch estimates the rank of any element with additive error at most $\varepsilon n$. A low-space algorithm solving this task has applications in database systems, network measurement, load balancing, and many other practical scenarios. Current quantile estimation algorithms described as optimal include the GK sketch (Greenwald and Khanna 2001) using $O(\varepsilon^{-1} \log n)$ words (deterministic) and the KLL sketch (Karnin, Lang, and Liberty 2016) using $O(\varepsilon^{-1} \log\log(1/δ))$ words (randomized, with failure probability $δ$). However, both algorithms are only optimal in the comparison-based model, whereas most typical applications involve streams of integers that the sketch can use aside from making comparisons. If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava, Buragohain, Agrawal, and Suri 2004) achieves a space complexity of $O(\varepsilon^{-1}\log U)$ words, which is incomparable to the previously-mentioned sketches. It has long been asked whether there is a quantile sketch using $O(\varepsilon^{-1})$ words of space (which is optimal as long as $n \leq \mathrm{poly}(U)$). In this work, we present a deterministic algorithm using $O(\varepsilon^{-1})$ words, resolving this line of work.

Optimal quantile estimation: beyond the comparison model

TL;DR

The paper addresses deterministic, space-efficient quantile estimation in streaming where the universe size is finite. It introduces a recursive, eager q-digest-based quantile sketch that layers multiple structures and batches insertions to achieve words of space, improving over prior non-comparison-based approaches and matching or approaching lower bounds in the natural regime. The core idea is to maintain non-full nodes via inner quantile sketches on a reduced universe (the exposed nodes) and recursively apply this idea to shrink the effective universe by roughly a logarithmic factor at each layer, yielding a total space of bits with near-optimal time complexity. The work also discusses practical considerations, such as partial mergeability, constant-factor optimizations, and how to support select queries with real elements, making the approach applicable in real systems and libraries that require tight space bounds for quantile queries.

Abstract

Estimating quantiles is one of the foundational problems of data sketching. Given elements from some universe of size arriving in a data stream, a quantile sketch estimates the rank of any element with additive error at most . A low-space algorithm solving this task has applications in database systems, network measurement, load balancing, and many other practical scenarios. Current quantile estimation algorithms described as optimal include the GK sketch (Greenwald and Khanna 2001) using words (deterministic) and the KLL sketch (Karnin, Lang, and Liberty 2016) using words (randomized, with failure probability ). However, both algorithms are only optimal in the comparison-based model, whereas most typical applications involve streams of integers that the sketch can use aside from making comparisons. If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava, Buragohain, Agrawal, and Suri 2004) achieves a space complexity of words, which is incomparable to the previously-mentioned sketches. It has long been asked whether there is a quantile sketch using words of space (which is optimal as long as ). In this work, we present a deterministic algorithm using words, resolving this line of work.
Paper Structure (40 sections, 11 theorems, 40 equations, 4 figures, 2 tables, 7 algorithms)

This paper contains 40 sections, 11 theorems, 40 equations, 4 figures, 2 tables, 7 algorithms.

Key Result

Theorem 1.2

There exists a deterministic streaming algorithm for prob:all-quantile-sketch using $O(\varepsilon^{-1})$ words (more specifically, $O(\varepsilon^{-1} (\log (\varepsilon n) + \log(\varepsilon U)))$ bits) of space.

Figures (4)

  • Figure 1: An example eager q-digest tree.
  • Figure 2: The tree formed by non-empty nodes in eager q-digest. (The filled nodes are the full nodes.)
  • Figure 3: An inner eager q-digest tree whose universe is the exposed nodes of the original tree. (The filled nodes are the full nodes.)
  • Figure 4: The structure of different layers. Here $\varepsilon = 0.5$, so there are $1/ \varepsilon = 2$ trees in each layer. The nodes below the base-level of each layer is marked as gray. Note that when we construct $T_i$, we take all the exposed nodes in $T_{i - 1}$ and use them as the base-level nodes to build $1/\varepsilon$ trees. Then we copy their subtrees in $T_{i - 1}$ to be their subtrees in $T_i$.

Theorems & Definitions (52)

  • Theorem 1.2
  • Conjecture 1.3: Deterministic parallel counters
  • Remark 5.1
  • Remark 5.2
  • Remark 5.3
  • Remark 5.4
  • Definition 5.8: Consistency
  • Definition 5.9: Discrepancy
  • Lemma 5.10: Step 1
  • proof
  • ...and 42 more