Table of Contents
Fetching ...

Computing Data Distribution from Query Selectivities

Pankaj K. Agarwal, Rahul Raychaudhury, Stavros Sintos, Jun Yang

TL;DR

This work studies how to recover a compact discrete data distribution that best explains a workload of range queries with observed selectivities. It proves NP-hardness for finding an optimal small distribution and then provides a Monte Carlo algorithm that constructs a distribution of size $O(\delta^{-2})$ achieving additive error $\delta$ (for $p\in\{1,2,\infty\}$) relative to the best possible, and does so in time near-linear in the number of training samples. An improved implicit-LP approach leverages geometric deepest-point computations and $\varepsilon$-approximations to attain near-linear running time in $n$ for fixed dimension, with extensions to other error norms and range families. The paper also establishes conditional lower bounds that suggest substantial improvements in dimension or approximation quality are unlikely, and it surveys related work in selectivity estimation and MWU-based optimization. Overall, the results offer provable guarantees for compact, query-driven representations of data distributions, with implications for efficient data modeling in database systems.

Abstract

We are given a set $\mathcal{Z}=\{(R_1,s_1),\ldots, (R_n,s_n)\}$, where each $R_i$ is a \emph{range} in $\Re^d$, such as rectangle or ball, and $s_i \in [0,1]$ denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution} $\mathcal{D}=\{(q_1,w_1),\ldots, (q_m,w_m)\}$, where $q_j\in \Re^d$ and $w_j\in [0,1]$ for each $1\leq j\leq m$, and $\sum_{1\leq j\leq m}w_j= 1$, such that $\mathcal{D}$ is the most \emph{consistent} with $\mathcal{Z}$, i.e., $\mathrm{err}_p(\mathcal{D},\mathcal{Z})=\frac{1}{n}\sum_{i=1}^n\! \lvert{s_i-\sum_{j=1}^m w_j\cdot 1(q_j\in R_i)}\rvert^p$ is minimized. In a database setting, $\mathcal{Z}$ corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and $\mathcal{D}$ can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is $\mathsf{NP}$-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time $O((n+δ^{-d})δ^{-2}\mathop{\mathrm{polylog}})$, a discrete distribution $\tilde{\mathcal{D}}$ of size $O(δ^{-2})$, such that $\mathrm{err}_p(\tilde{\mathcal{D}},\mathcal{Z})\leq \min_{\mathcal{D}}\mathrm{err}_p(\mathcal{D},\mathcal{Z})+δ$ (for $p=1,2,\infty$) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

Computing Data Distribution from Query Selectivities

TL;DR

This work studies how to recover a compact discrete data distribution that best explains a workload of range queries with observed selectivities. It proves NP-hardness for finding an optimal small distribution and then provides a Monte Carlo algorithm that constructs a distribution of size achieving additive error (for ) relative to the best possible, and does so in time near-linear in the number of training samples. An improved implicit-LP approach leverages geometric deepest-point computations and -approximations to attain near-linear running time in for fixed dimension, with extensions to other error norms and range families. The paper also establishes conditional lower bounds that suggest substantial improvements in dimension or approximation quality are unlikely, and it surveys related work in selectivity estimation and MWU-based optimization. Overall, the results offer provable guarantees for compact, query-driven representations of data distributions, with implications for efficient data modeling in database systems.

Abstract

We are given a set , where each is a \emph{range} in , such as rectangle or ball, and denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution} , where and for each , and , such that is the most \emph{consistent} with , i.e., is minimized. In a database setting, corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is -complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time , a discrete distribution of size , such that (for ) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.
Paper Structure (20 sections, 17 theorems, 19 equations, 5 figures)

This paper contains 20 sections, 17 theorems, 19 equations, 5 figures.

Key Result

Lemma 1

For any discrete distribution $\mathcal{D}\in \mathbb{D}$, there is another discrete distribution $\mathcal{D}'$ such that $\mathrm{supp}(\mathcal{D}')\subseteq \mathcal{P}$ and $\mathrm{err}(\mathcal{D},\mathcal{Z})=\mathrm{err}(\mathcal{D}',\mathcal{Z})$.

Figures (5)

  • Figure 1: The (black) points $q_i$ represent points from the underlying data distribution $\mathcal{D}$ while (blue) points $p_\mathsf{\tau}\in \mathcal{P}$. The (green) dashed segments show three cells of the arrangement in rectangle $R$. The weights of points in $\mathcal{P}$ are: $w(p_1)=w_1+w_3$, $w(p_2)=w_2$, $w(p_3)=w_4+w_5$, $w(p_4)=w_6+w_8$, $w(p_5)=w_7$.
  • Figure 2: Schematic with two clauses and four variables
  • Figure 3: Variable Chain
  • Figure 4: Junction of variables $x_1, x_2$.
  • Figure 5: Clause $C_5=(\neg x_1 \lor x_2 \lor \neg x_3)$. The gray areas are the intersections $R_{16}^1\cap R_{17}^1$, $R_{21}^2\cap R_{22}^2$.

Theorems & Definitions (17)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Theorem 9
  • Theorem 10
  • ...and 7 more