Table of Contents
Fetching ...

GRACEFUL: A Learned Cost Estimator For UDFs

Johannes Wehrstein, Tiemo Bang, Roman Heinrich, Carsten Binnig

TL;DR

This paper tackles the challenge of predicting runtimes for queries that contain User-Defined Functions (UDFs), a capability often missing in traditional DBMS cost models. GRACEFUL introduces a graph-based cost estimator that encodes UDFs via a control-flow graph (CFG), augments it with data-flow and selectivity annotations, and embeds it jointly with the surrounding query plan in a graph neural network to predict end-to-end runtimes in a zero-shot setting. A pull-up/push-down advisor leverages the cost model to decide whether to move UDF predicates up or down the plan, using regret optimization to handle uncertain UDF selectivities and achieving substantial speedups with minimal overhead. The authors also release a large synthetic UDF benchmark (90k queries across 20 databases) to promote research in UDF cost modeling, and demonstrate strong accuracy and robustness across unseen UDFs, workloads, and datasets. Overall, GRACEFUL advances UDF-aware optimization by providing transferable, structure-aware cost predictions and a practical optimization aid that yields significant performance gains in real-world-like scenarios.

Abstract

User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges, primarily due to the difficulty of estimating UDF execution costs. Consequently, existing cost models in DBMS optimizers largely ignore UDFs or rely on static assumptions, resulting in suboptimal performance for queries involving UDFs. In this paper, we introduce GRACEFUL, a novel learned cost model to make accurate cost predictions of query plans with UDFs enabling optimization decisions for UDFs in DBMS. For example, as we show in our evaluation, using our cost model, we can achieve 50x speedups through informed pull-up/push-down filter decisions of the UDF compared to the standard case where always a filter push-down is applied. Additionally, we release a synthetic dataset of over 90,000 UDF queries to promote further research in this area.

GRACEFUL: A Learned Cost Estimator For UDFs

TL;DR

This paper tackles the challenge of predicting runtimes for queries that contain User-Defined Functions (UDFs), a capability often missing in traditional DBMS cost models. GRACEFUL introduces a graph-based cost estimator that encodes UDFs via a control-flow graph (CFG), augments it with data-flow and selectivity annotations, and embeds it jointly with the surrounding query plan in a graph neural network to predict end-to-end runtimes in a zero-shot setting. A pull-up/push-down advisor leverages the cost model to decide whether to move UDF predicates up or down the plan, using regret optimization to handle uncertain UDF selectivities and achieving substantial speedups with minimal overhead. The authors also release a large synthetic UDF benchmark (90k queries across 20 databases) to promote research in UDF cost modeling, and demonstrate strong accuracy and robustness across unseen UDFs, workloads, and datasets. Overall, GRACEFUL advances UDF-aware optimization by providing transferable, structure-aware cost predictions and a practical optimization aid that yields significant performance gains in real-world-like scenarios.

Abstract

User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges, primarily due to the difficulty of estimating UDF execution costs. Consequently, existing cost models in DBMS optimizers largely ignore UDFs or rely on static assumptions, resulting in suboptimal performance for queries involving UDFs. In this paper, we introduce GRACEFUL, a novel learned cost model to make accurate cost predictions of query plans with UDFs enabling optimization decisions for UDFs in DBMS. For example, as we show in our evaluation, using our cost model, we can achieve 50x speedups through informed pull-up/push-down filter decisions of the UDF compared to the standard case where always a filter push-down is applied. Additionally, we release a synthetic dataset of over 90,000 UDF queries to promote further research in this area.

Paper Structure

This paper contains 21 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The impact of pull-up optimization an a SQL query with a UDF: carefully chosen pull-ups of udf filters in a query plan can drastically reduce query runtime. The numbers along the query edges annotate the intermediate cardinalities.
  • Figure 2: Cost Estimation Overview for a SQL Query ⓪ with a UDF ①: For the UDF code, the CFG is computed ② and used as a transformed DAG representation of the code ③. On top of the CFG, we annotate estimates of the hit ratios of branches using a cardinality estimator ④. The UDF graph is combined with the query plan graph ⑤, and a GNN model predicts the runtime for the joint graph ⑥.
  • Figure 3: ① With the push-down of the UDF, the uncertainty in estimating the cost of join and other post-UDF operators increases drastically since the selectivity of the UDF-filter cannot be estimated and subsequent cardinality estimates are therefore inaccurate. ② Conversely, with a pull-up of the UDF, the UDF now has inaccurate cardinality estimates since, with the increasing number of joins, the accuracy of cardinality estimates drops leading to inaccurate cost estimates for the UDF.
  • Figure 4: UDF Pull-up/Push-down Advisor: First, query plans are generated for both the pull-up and push-down plan. Next, GRACEFUL enumerates multiple selectivity values for the (unknown and unpredictable) UDF-filter selectivity and creates a separate graph for each selectivity. These annotated plans are then passed through the GNN models, producing two cost distributions (one for the pull-up plan and one for the push-down plan). Based on a heuristics-based selection strategy we choose one plan alternative.
  • Figure 5: Cost Estimates for the different datasets. The errors are collected in a leave-on-out cross-validation fashion: the system is trained on 19 datasets and tested on the unseen 20th dataset. The y-axis shows the Q-Error (in logarithmic scale). The model shows a robust performance across all the twenty datasets. It is visible that some datasets are more challenging for some cardinality estimators, leading to higher errors in our cost estimates.
  • ...and 3 more figures