Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

Claude Lehmann; Pavel Sulimov; Kurt Stockinger

Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

Claude Lehmann, Pavel Sulimov, Kurt Stockinger

TL;DR

This paper presents an end-to-end benchmarking framework for learned query optimizers (LQOs) to address reproducibility and data-generation issues that have hampered fair evaluation. It systematically analyzes training data generation, query/plan encodings, training practices, and evaluation methodology, highlighting how covariate shift and dataset splits influence performance. Through extensive experiments on JOB and STACK workloads, it demonstrates that PostgreSQL often outperforms state-of-the-art LQOs when evaluated under standardized, end-to-end conditions, underscoring the need for robust benchmarking and careful pipeline design. The work provides practical recommendations and a framework to enable fair, reproducible, end-to-end comparisons, inviting the community to re-think where LQOs offer genuine advantages and how to measure them reliably.

Abstract

The current boom of learned query optimizers (LQO) can be explained not only by the general continuous improvement of deep learning (DL) methods but also by the straightforward formulation of a query optimization problem (QOP) as a machine learning (ML) one. The idea is often to replace dynamic programming approaches, widespread for solving QOP, with more powerful methods such as reinforcement learning. However, such a rapid "game change" in the field of QOP could not pass without consequences - other parts of the ML pipeline, except for predictive model development, have large improvement potential. For instance, different LQOs introduce their own restrictions on training data generation from queries, use an arbitrary train/validation approach, and evaluate on a voluntary split of benchmark queries. In this paper, we attempt to standardize the ML pipeline for evaluating LQOs by introducing a new end-to-end benchmarking framework. Additionally, we guide the reader through each data science stage in the ML pipeline and provide novel insights from the machine learning perspective, considering the specifics of QOP. Finally, we perform a rigorous evaluation of existing LQOs, showing that PostgreSQL outperforms these LQOs in almost all experiments depending on the train/test splits.

Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

TL;DR

Abstract

Paper Structure (36 sections, 7 figures, 2 tables)

This paper contains 36 sections, 7 figures, 2 tables.

Introduction
Related Work
Training Data Generation
Dataset Choice
Reduced Complexity of Query Plans
Invariant Training Data Generation
Feature Variables: Dynamic Optimization
Dependent Variables: Cold vs. Hot Cache
Query & Plan Encoding
Encoding Robustness
Encoding Expressiveness
Training Learned Query Optimizers
Avoiding ML Model Overfitting
Changing Target Variables On-the-Fly
Evaluating Learned Query Optimizers
...and 21 more sections

Figures (7)

Figure 1: Comparison of classical and learned query optimizers (LQO) - see top and bottom halves, respectively. The stages (1) Training Data Generation, (3) LQO Training, and (4) LQO Evaluation are the primary components of our End-to-End Benchmarking Framework. Together with the (2) Query & Plan Encoding stage, they form the typical machine learning pipeline for a LQO.
Figure 2: Scatter plot of the execution time per number of joins for all queries in JOB.
Figure 3: Overview of different dataset split sampling types for JOB: Leave One Out Sampling (top), Random Sampling (middle), and Base Query Sampling (bottom). For instance, Base Query 1 has 4 variations: 1a, 1b, 1c and 1d.
Figure 4: Comparative overview of each method's performance on the test set of various dataset splits on the Join Order Benchmark (JOB). The figure on the left depicts the planning time (darker colour) and inference time (lighter colour), respectively. Note that Bao runs inside PostgreSQL as an extension, and its inference time is directly added to the planning time. The figure on the right side shows the execution times on the same train/test splits. Please observe that the x-axis of both figures is divided.
Figure 5: Comparative overview of each method's performance on the test set of various dataset splits on STACK.
...and 2 more figures

Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

TL;DR

Abstract

Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (7)