Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Xiao Hu; Jinchao Huang

Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Xiao Hu, Jinchao Huang

Abstract

Uniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and counting algorithms for full join queries, a significant gap remains for join-project queries, which are ubiquitous in real-world workloads. The state-of-the-art ``propose-and-verify'' framework \cite{chen2020random} for these queries suffers from fundamental inefficiencies, often yielding prohibitive complexity when projections significantly reduce the output size. In this paper, we present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication. Finally, we delineate the theoretical limits of the problem space. While matrix and star queries admit efficient sublinear-time algorithms, we establish a significantly stronger lower bound for chain queries, demonstrating that sublinear algorithms are impossible in general.

Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Abstract

Paper Structure (43 sections, 26 theorems, 8 equations, 1 figure, 12 algorithms)

This paper contains 43 sections, 26 theorems, 8 equations, 1 figure, 12 algorithms.

Introduction
Problem Definitions
Uniform Sampling and Approximate Counting for Join-project Queries
Model of Computation
Handling Empty Results
Previous Results
Our Results
Uniform Sampling over Matrix Query
Framework
Step 1: Sample Full Join Result
Auxiliary indices
Step 2: Acceptance Check
Analysis
Standard Amplification Techniques
A variant of Step 1: Sample Full Join Result
...and 28 more sections

Key Result

Proposition 2.1

Let $Y$ be a discrete random variable taking values in $[M]$, where $M$ is a positive integer. Let $U$ be a uniform random variable on $[M]$, independent of $Y$. Then $\Pr[U \le Y] = \frac{1}{M} \cdot \mathbb{E}[Y]$.

Figures (1)

Figure 1: Comparison between prior results and our new results (in red) in the property testing model. $N$ is the input size, and $\textsf{\upshape OUT}$ is the output size. $k$ is the number of relations. For uniform sampling, we assume $O(N)$ preprocessing time. The lower bounds for uniform sampling over chain queries only apply to $W$-uniform sampling algorithms.For approximate counting, we assume $\epsilon$ as a small constant and $\delta =1/N^{O(1)}$ here.

Theorems & Definitions (27)

Proposition 2.1
Definition 2.2: Negative Hypergeometric Distribution johnson2005univariate
Theorem 2.3
Theorem 2.4
Proposition 3.1: chen2020random
Lemma 3.2
Lemma 3.3
Lemma 3.4
Lemma 3.5
Lemma 3.6
...and 17 more

Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Abstract

Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries

Authors

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (27)