A Theoretical Framework for Distribution-Aware Dataset Search
Aryan Esmailpour, Sainyam Galhotra, Rahul Raychaudhury, Stavros Sintos
TL;DR
This work introduces a theoretical framework for distribution-aware dataset search, unifying percentile- and top-k preference-based queries under centralized and federated settings. It shows fundamental lower bounds for exact structures and then develops approximate, near-linear space data structures with polylog query time and output-sensitive performance for Pt ile and Pref, using coresets (ε-samples) and ε-nets, plus dynamic range-tree indexing with delay guarantees. The methods extend to conjunctions/disjunctions of predicates, support dynamic updates, and adapt to general synopses with known or unknown error δ, providing strong guarantees where prior heuristic approaches lacked absence of misses. The framework and data structures have practical impact for data marketplaces and data discovery tasks needing distributional guarantees despite limited data access. Overall, the paper advances distribution-aware dataset search by delivering rigorous, scalable algorithms with provable accuracy and performance in both centralized and federated contexts.
Abstract
Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristics. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. Let $\mathcal{P}=\{P_1,...,P_N\}$ be a repository of $N$ datasets, where $P_i\subset \mathbb{R}^d$, for $d=O(1)$ . We study the percentile indexing (Ptile) problem and the preference indexing (Pref) problem under the centralized and the federated setting. In the centralized setting we assume direct access to the datasets. In the federated setting we assume access to a synopsis of each dataset. The goal of Ptile is to construct a data structure such that given a predicate (rectangle $R$ and interval $θ$) report all indexes $J$ such that $j\in J$ iff $|P_j\cap R|/|P_j|\inθ$. The goal of Pref is to construct a data structure such that given a predicate (vector $v$ and interval $θ$) report all indexes $J$ such that $j\in J$ iff $ω(P_j,v)\in θ$, where $ω(P_j,v)$ is the inner-product of the $k$-th largest projection of $P_j$ on $v$. We first show that we cannot hope for near-linear data structures with polylogarithmic query time in the centralized setting. Next we show $\tilde{O}(N)$ space data structures that answer Ptile and Pref queries in $\tilde{O}(1+OUT)$ time, where $OUT$ is the output size. Each data structure returns a set of indexes $J$ such that i) for every $P_i$ that satisfies the predicate, $i\in J$ and ii) if $j\in J$ then $P_j$ satisfies the predicate up to an additive error $\varepsilon+2δ$, where $\varepsilon\in(0,1)$ and $δ$ is the error of synopses.
