Computing Data Distribution from Query Selectivities

Pankaj K. Agarwal; Rahul Raychaudhury; Stavros Sintos; Jun Yang

Computing Data Distribution from Query Selectivities

Pankaj K. Agarwal, Rahul Raychaudhury, Stavros Sintos, Jun Yang

TL;DR

This work studies how to recover a compact discrete data distribution that best explains a workload of range queries with observed selectivities. It proves NP-hardness for finding an optimal small distribution and then provides a Monte Carlo algorithm that constructs a distribution of size $O(\delta^{-2})$ achieving additive error $\delta$ (for $p\in\{1,2,\infty\}$) relative to the best possible, and does so in time near-linear in the number of training samples. An improved implicit-LP approach leverages geometric deepest-point computations and $\varepsilon$-approximations to attain near-linear running time in $n$ for fixed dimension, with extensions to other error norms and range families. The paper also establishes conditional lower bounds that suggest substantial improvements in dimension or approximation quality are unlikely, and it surveys related work in selectivity estimation and MWU-based optimization. Overall, the results offer provable guarantees for compact, query-driven representations of data distributions, with implications for efficient data modeling in database systems.

Abstract

We are given a set $\mathcal{Z}=\{(R_1,s_1),\ldots, (R_n,s_n)\}$, where each $R_i$ is a \emph{range} in $\Re^d$, such as rectangle or ball, and $s_i \in [0,1]$ denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution} $\mathcal{D}=\{(q_1,w_1),\ldots, (q_m,w_m)\}$, where $q_j\in \Re^d$ and $w_j\in [0,1]$ for each $1\leq j\leq m$, and $\sum_{1\leq j\leq m}w_j= 1$, such that $\mathcal{D}$ is the most \emph{consistent} with $\mathcal{Z}$, i.e., $\mathrm{err}_p(\mathcal{D},\mathcal{Z})=\frac{1}{n}\sum_{i=1}^n\! \lvert{s_i-\sum_{j=1}^m w_j\cdot 1(q_j\in R_i)}\rvert^p$ is minimized. In a database setting, $\mathcal{Z}$ corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and $\mathcal{D}$ can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is $\mathsf{NP}$-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time $O((n+δ^{-d})δ^{-2}\mathop{\mathrm{polylog}})$, a discrete distribution $\tilde{\mathcal{D}}$ of size $O(δ^{-2})$, such that $\mathrm{err}_p(\tilde{\mathcal{D}},\mathcal{Z})\leq \min_{\mathcal{D}}\mathrm{err}_p(\mathcal{D},\mathcal{Z})+δ$ (for $p=1,2,\infty$) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

Computing Data Distribution from Query Selectivities

TL;DR

achieving additive error

(for

) relative to the best possible, and does so in time near-linear in the number of training samples. An improved implicit-LP approach leverages geometric deepest-point computations and

-approximations to attain near-linear running time in

for fixed dimension, with extensions to other error norms and range families. The paper also establishes conditional lower bounds that suggest substantial improvements in dimension or approximation quality are unlikely, and it surveys related work in selectivity estimation and MWU-based optimization. Overall, the results offer provable guarantees for compact, query-driven representations of data distributions, with implications for efficient data modeling in database systems.

Abstract

We are given a set

, where each

is a \emph{range} in

, such as rectangle or ball, and

denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution}

, where

and

for each

, and

, such that

is the most \emph{consistent} with

, i.e.,

is minimized. In a database setting,

corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and

can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is

-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time

, a discrete distribution

of size

, such that

(for

) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

Paper Structure (20 sections, 17 theorems, 19 equations, 5 figures)

This paper contains 20 sections, 17 theorems, 19 equations, 5 figures.

Introduction
Basic Algorithm
Decision procedure
MWU algorithm
Analysis
The Improved Algorithm
Extensions
Extending to other error functions
Extending to other types of ranges
Hardness
NP-Completeness
Conditional lower bounds
Related Work
Conclusion and future work
Missing proofs from Section \ref{['sec:hardness']}
...and 5 more sections

Key Result

Lemma 1

For any discrete distribution $\mathcal{D}\in \mathbb{D}$, there is another discrete distribution $\mathcal{D}'$ such that $\mathrm{supp}(\mathcal{D}')\subseteq \mathcal{P}$ and $\mathrm{err}(\mathcal{D},\mathcal{Z})=\mathrm{err}(\mathcal{D}',\mathcal{Z})$.

Figures (5)

Figure 1: The (black) points $q_i$ represent points from the underlying data distribution $\mathcal{D}$ while (blue) points $p_\mathsf{\tau}\in \mathcal{P}$. The (green) dashed segments show three cells of the arrangement in rectangle $R$. The weights of points in $\mathcal{P}$ are: $w(p_1)=w_1+w_3$, $w(p_2)=w_2$, $w(p_3)=w_4+w_5$, $w(p_4)=w_6+w_8$, $w(p_5)=w_7$.
Figure 2: Schematic with two clauses and four variables
Figure 3: Variable Chain
Figure 4: Junction of variables $x_1, x_2$.
Figure 5: Clause $C_5=(\neg x_1 \lor x_2 \lor \neg x_3)$. The gray areas are the intersections $R_{16}^1\cap R_{17}^1$, $R_{21}^2\cap R_{22}^2$.

Theorems & Definitions (17)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Lemma 7
Lemma 8
Theorem 9
Theorem 10
...and 7 more

Computing Data Distribution from Query Selectivities

TL;DR

Abstract

Computing Data Distribution from Query Selectivities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (17)