Computing Data Distribution from Query Selectivities
Pankaj K. Agarwal, Rahul Raychaudhury, Stavros Sintos, Jun Yang
TL;DR
This work studies how to recover a compact discrete data distribution that best explains a workload of range queries with observed selectivities. It proves NP-hardness for finding an optimal small distribution and then provides a Monte Carlo algorithm that constructs a distribution of size $O(\delta^{-2})$ achieving additive error $\delta$ (for $p\in\{1,2,\infty\}$) relative to the best possible, and does so in time near-linear in the number of training samples. An improved implicit-LP approach leverages geometric deepest-point computations and $\varepsilon$-approximations to attain near-linear running time in $n$ for fixed dimension, with extensions to other error norms and range families. The paper also establishes conditional lower bounds that suggest substantial improvements in dimension or approximation quality are unlikely, and it surveys related work in selectivity estimation and MWU-based optimization. Overall, the results offer provable guarantees for compact, query-driven representations of data distributions, with implications for efficient data modeling in database systems.
Abstract
We are given a set $\mathcal{Z}=\{(R_1,s_1),\ldots, (R_n,s_n)\}$, where each $R_i$ is a \emph{range} in $\Re^d$, such as rectangle or ball, and $s_i \in [0,1]$ denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution} $\mathcal{D}=\{(q_1,w_1),\ldots, (q_m,w_m)\}$, where $q_j\in \Re^d$ and $w_j\in [0,1]$ for each $1\leq j\leq m$, and $\sum_{1\leq j\leq m}w_j= 1$, such that $\mathcal{D}$ is the most \emph{consistent} with $\mathcal{Z}$, i.e., $\mathrm{err}_p(\mathcal{D},\mathcal{Z})=\frac{1}{n}\sum_{i=1}^n\! \lvert{s_i-\sum_{j=1}^m w_j\cdot 1(q_j\in R_i)}\rvert^p$ is minimized. In a database setting, $\mathcal{Z}$ corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and $\mathcal{D}$ can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is $\mathsf{NP}$-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time $O((n+δ^{-d})δ^{-2}\mathop{\mathrm{polylog}})$, a discrete distribution $\tilde{\mathcal{D}}$ of size $O(δ^{-2})$, such that $\mathrm{err}_p(\tilde{\mathcal{D}},\mathcal{Z})\leq \min_{\mathcal{D}}\mathrm{err}_p(\mathcal{D},\mathcal{Z})+δ$ (for $p=1,2,\infty$) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.
