Table of Contents
Fetching ...

Avoiding Materialisation for Guarded Aggregate Queries

Matthias Lanzinger, Reinhard Pichler, Alexander Selzer

TL;DR

This paper addresses the challenge of excessive intermediate results in analytic queries with many joins by introducing guarded aggregate queries, which permit evaluating aggregates without materialising join results. The core idea is to propagate frequencies and aggregate information up a join tree using a frequency attribute $c_u$ and, in the guarded setting, a root guard that contains all grouping and aggregate inputs; this enables exact evaluation for a wide class of aggregates. The authors extend this with piece-wise-guarded queries and a new AggJoin physical operator to fuse joining and aggregation, achieving linear-space data propagation and seamless integration into Spark SQL. Empirical results across multiple benchmarks show substantial speedups (up to orders of magnitude) on challenging queries, while simple queries incur little or no overhead. The work demonstrates significant practical impact for large-scale analytical workloads by avoiding costly materialisation and offering a pathway to extend these ideas to broader query classes.

Abstract

Optimising queries with many joins is known to be a hard problem. The explosion of intermediate results as opposed to a much smaller final result poses a serious challenge to modern database management systems (DBMSs). This is particularly glaring in case of analytical queries that join many tables, but ultimately only output comparatively small aggregate information. Analogous problems are faced by graph database systems when processing analytical queries with aggregates on top of complex path queries. In this work, we propose novel optimisation techniques both, on the logical and physical level, that allow us to avoid the materialisation of join results for certain types of aggregate queries. The key to these optimisations is the notion of guardedness, by which we impose restrictions on the occurrence of attributes in GROUP BY clauses and in aggregate expressions. The efficacy of our optimisations is validated through their implementation in Spark SQL and extensive empirical evaluation on various standard benchmarks.

Avoiding Materialisation for Guarded Aggregate Queries

TL;DR

This paper addresses the challenge of excessive intermediate results in analytic queries with many joins by introducing guarded aggregate queries, which permit evaluating aggregates without materialising join results. The core idea is to propagate frequencies and aggregate information up a join tree using a frequency attribute and, in the guarded setting, a root guard that contains all grouping and aggregate inputs; this enables exact evaluation for a wide class of aggregates. The authors extend this with piece-wise-guarded queries and a new AggJoin physical operator to fuse joining and aggregation, achieving linear-space data propagation and seamless integration into Spark SQL. Empirical results across multiple benchmarks show substantial speedups (up to orders of magnitude) on challenging queries, while simple queries incur little or no overhead. The work demonstrates significant practical impact for large-scale analytical workloads by avoiding costly materialisation and offering a pathway to extend these ideas to broader query classes.

Abstract

Optimising queries with many joins is known to be a hard problem. The explosion of intermediate results as opposed to a much smaller final result poses a serious challenge to modern database management systems (DBMSs). This is particularly glaring in case of analytical queries that join many tables, but ultimately only output comparatively small aggregate information. Analogous problems are faced by graph database systems when processing analytical queries with aggregates on top of complex path queries. In this work, we propose novel optimisation techniques both, on the logical and physical level, that allow us to avoid the materialisation of join results for certain types of aggregate queries. The key to these optimisations is the notion of guardedness, by which we impose restrictions on the occurrence of attributes in GROUP BY clauses and in aggregate expressions. The efficacy of our optimisations is validated through their implementation in Spark SQL and extensive empirical evaluation on various standard benchmarks.

Paper Structure

This paper contains 20 sections, 4 equations, 7 figures, 6 tables, 2 algorithms.

Figures (7)

  • Figure 1: Query over the TPC-H Schema
  • Figure 2: Join tree for the query in Fig. \ref{['fig:tpch-query']}
  • Figure 3: Query plans for Example \ref{['exp:median']}
  • Figure 4: Evaluation of the query from Figure \ref{['fig:tpch-query']}
  • Figure 5: Comparison of the maximal number of materialised tuples in a table during query execution for 20 queries of STATS-CEB. Y-axis in logarithmic scale (base 10).
  • ...and 2 more figures

Theorems & Definitions (7)

  • Example 1.1
  • definition 1
  • Example 4.1
  • Example 4.2
  • definition 2
  • Example A.1
  • Example A.2