A Tight Lower Bound for Comparison-Based Quantile Summaries
Graham Cormode, Pavel Veselý
TL;DR
This work proves a tight Ω( (1/ε) · log(ε N) ) space lower bound for deterministic, comparison-based ε-approximate quantile summaries in one-pass streaming, matching the GK upper bound and ruling out sublogarithmic, N-independent space. The authors introduce indistinguishable streams and a recursive adversarial construction that refines intervals to maximize the rank gap while preserving indistinguishability, culminating in a space–gap inequality that links memory usage to the largest gap. The results extend to biased (relative-error) quantiles and yield corollaries for approximate median computation and rank estimation, and they provide a framework for understanding the limits of both deterministic and randomized, comparison-based streaming quantile methods. Overall, the paper settles the asymptotic space complexity landscape for this problem (in the deterministic, comparison-based model) and clarifies the tradeoffs and boundaries for related tasks in streaming quantile computation.
Abstract
Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles, up to an error of at most $\varepsilon$. That is, an $\varepsilon$-approximate quantile summary first processes a stream of items and then, given any quantile query $0\le φ\le 1$, returns an item from the stream, which is a $φ'$-quantile for some $φ' = φ\pm \varepsilon$. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna (SIGMOD '01), stores at most $O(\frac{1}{\varepsilon}\cdot \log \varepsilon N)$ items, where $N$ is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space $f(\varepsilon)\cdot o(\log N)$, for any function $f$ that does not depend on $N$. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of $(1\pm \varepsilon)\cdot φ$, and for other related computational tasks.
