GRACEFUL: A Learned Cost Estimator For UDFs
Johannes Wehrstein, Tiemo Bang, Roman Heinrich, Carsten Binnig
TL;DR
This paper tackles the challenge of predicting runtimes for queries that contain User-Defined Functions (UDFs), a capability often missing in traditional DBMS cost models. GRACEFUL introduces a graph-based cost estimator that encodes UDFs via a control-flow graph (CFG), augments it with data-flow and selectivity annotations, and embeds it jointly with the surrounding query plan in a graph neural network to predict end-to-end runtimes in a zero-shot setting. A pull-up/push-down advisor leverages the cost model to decide whether to move UDF predicates up or down the plan, using regret optimization to handle uncertain UDF selectivities and achieving substantial speedups with minimal overhead. The authors also release a large synthetic UDF benchmark (90k queries across 20 databases) to promote research in UDF cost modeling, and demonstrate strong accuracy and robustness across unseen UDFs, workloads, and datasets. Overall, GRACEFUL advances UDF-aware optimization by providing transferable, structure-aware cost predictions and a practical optimization aid that yields significant performance gains in real-world-like scenarios.
Abstract
User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges, primarily due to the difficulty of estimating UDF execution costs. Consequently, existing cost models in DBMS optimizers largely ignore UDFs or rely on static assumptions, resulting in suboptimal performance for queries involving UDFs. In this paper, we introduce GRACEFUL, a novel learned cost model to make accurate cost predictions of query plans with UDFs enabling optimization decisions for UDFs in DBMS. For example, as we show in our evaluation, using our cost model, we can achieve 50x speedups through informed pull-up/push-down filter decisions of the UDF compared to the standard case where always a filter push-down is applied. Additionally, we release a synthetic dataset of over 90,000 UDF queries to promote further research in this area.
