Optimal quantile estimation: beyond the comparison model
Meghal Gupta, Mihir Singhal, Hongxun Wu
TL;DR
The paper addresses deterministic, space-efficient quantile estimation in streaming where the universe size $U$ is finite. It introduces a recursive, eager q-digest-based quantile sketch that layers multiple structures and batches insertions to achieve $O\left(\varepsilon^{-1}\right)$ words of space, improving over prior non-comparison-based approaches and matching or approaching lower bounds in the natural regime. The core idea is to maintain non-full nodes via inner quantile sketches on a reduced universe (the exposed nodes) and recursively apply this idea to shrink the effective universe by roughly a logarithmic factor at each layer, yielding a total space of $O\left(\varepsilon^{-1}(\log(\varepsilon n) + \log(\varepsilon U))\right)$ bits with near-optimal time complexity. The work also discusses practical considerations, such as partial mergeability, constant-factor optimizations, and how to support select queries with real elements, making the approach applicable in real systems and libraries that require tight space bounds for quantile queries.
Abstract
Estimating quantiles is one of the foundational problems of data sketching. Given $n$ elements $x_1, x_2, \dots, x_n$ from some universe of size $U$ arriving in a data stream, a quantile sketch estimates the rank of any element with additive error at most $\varepsilon n$. A low-space algorithm solving this task has applications in database systems, network measurement, load balancing, and many other practical scenarios. Current quantile estimation algorithms described as optimal include the GK sketch (Greenwald and Khanna 2001) using $O(\varepsilon^{-1} \log n)$ words (deterministic) and the KLL sketch (Karnin, Lang, and Liberty 2016) using $O(\varepsilon^{-1} \log\log(1/δ))$ words (randomized, with failure probability $δ$). However, both algorithms are only optimal in the comparison-based model, whereas most typical applications involve streams of integers that the sketch can use aside from making comparisons. If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava, Buragohain, Agrawal, and Suri 2004) achieves a space complexity of $O(\varepsilon^{-1}\log U)$ words, which is incomparable to the previously-mentioned sketches. It has long been asked whether there is a quantile sketch using $O(\varepsilon^{-1})$ words of space (which is optimal as long as $n \leq \mathrm{poly}(U)$). In this work, we present a deterministic algorithm using $O(\varepsilon^{-1})$ words, resolving this line of work.
