Table of Contents
Fetching ...

PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees (Technical Report)

Yuxuan Zhu, Tengjun Jin, Stefanos Baziotis, Chengsong Zhang, Charith Mendis, Daniel Kang

TL;DR

PilotDB tackles the limited industry adoption of approximate query processing by delivering a practical middleware that provides a priori error guarantees without DBMS changes. It combines TAQA, a two-stage online AQP algorithm, with BSAP, a block-sampling theory that preserves statistical guarantees for nested and join queries. Empirical evaluation across PostgreSQL, SQL Server, and DuckDB shows up to 126× speedups with a 5% error target, while maintaining guaranteed error bounds. The approach eliminates maintenance overhead and DBMS modification, offering a scalable, database-agnostic path to fast, reliable approximate analytics in real-world workloads.

Abstract

After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid modifications to database management systems. To address these challenges, we introduce two novel techniques, TAQA and BSAP. TAQA is a two-stage online AQP algorithm that achieves all three properties for arbitrary queries. However, it can be slower than exact queries if we use standard row-level sampling. BSAP resolves this by enabling block-level sampling with statistical guarantees in TAQA. We simple ment TAQA and BSAP in a prototype middleware system, PilotDB, that is compatible with all DBMSs supporting efficient block-level sampling. We evaluate PilotDB on PostgreSQL, SQL Server, and DuckDB over real-world benchmarks, demonstrating up to 126X speedups when running with a 5% guaranteed error.

PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees (Technical Report)

TL;DR

PilotDB tackles the limited industry adoption of approximate query processing by delivering a practical middleware that provides a priori error guarantees without DBMS changes. It combines TAQA, a two-stage online AQP algorithm, with BSAP, a block-sampling theory that preserves statistical guarantees for nested and join queries. Empirical evaluation across PostgreSQL, SQL Server, and DuckDB shows up to 126× speedups with a 5% error target, while maintaining guaranteed error bounds. The approach eliminates maintenance overhead and DBMS modification, offering a scalable, database-agnostic path to fast, reliable approximate analytics in real-world workloads.

Abstract

After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid modifications to database management systems. To address these challenges, we introduce two novel techniques, TAQA and BSAP. TAQA is a two-stage online AQP algorithm that achieves all three properties for arbitrary queries. However, it can be slower than exact queries if we use standard row-level sampling. BSAP resolves this by enabling block-level sampling with statistical guarantees in TAQA. We simple ment TAQA and BSAP in a prototype middleware system, PilotDB, that is compatible with all DBMSs supporting efficient block-level sampling. We evaluate PilotDB on PostgreSQL, SQL Server, and DuckDB over real-world benchmarks, demonstrating up to 126X speedups when running with a 5% guaranteed error.

Paper Structure

This paper contains 40 sections, 16 theorems, 30 equations, 19 figures, 5 tables.

Key Result

theorem 1

Assuming that the aggregate to estimate is sub-Gaussian, if the input query $Q_{in}$ is rewritten into a final query $Q_{final}$ based on the sampling plan $\Theta$ obtained from the Procedure theorem:rel-error, the estimated aggregate $\hat{\mu}$ computed in $Q_{final}$ satisfies the inequality: $\

Figures (19)

  • Figure 1: Sampling 3% data from a table with a block size of 100 rows: in expectation, the row-level method requires scanning approximately 95% blocks, while the block-level method scans approximately 3% blocks.
  • Figure 2: Workflow of PilotDB.
  • Figure 3: Demonstration of query rewriting with PostgreSQL syntax. Rewritten parts are emphasized.
  • Figure 4: Comparison of the system efficiency of sampling methods that do not modify DBMSs. At small sampling rates, such as 0.01%, block sampling can be 500$\times$ faster than others.
  • Figure 5: Demonstration of the commutativity between block sampling $\mathcal{B}$ and relational operations $f_R$ that remove data (e.g., WHERE, JOIN conditions, and GROUP BY); Order of operations does not affect the inclusion probability of each data block.
  • ...and 14 more figures

Theorems & Definitions (18)

  • theorem 1
  • lemma 1
  • lemma 2
  • definition 1
  • proposition 1
  • proposition 2
  • proposition 3
  • proposition 4
  • theorem 2
  • lemma 3
  • ...and 8 more