Stage: Query Execution Time Prediction in Amazon Redshift

Ziniu Wu; Ryan Marcus; Zhengchun Liu; Parimarjan Negi; Vikram Nathan; Pascal Pfeil; Gaurav Saxena; Mohammad Rahman; Balakrishnan Narayanaswamy; Tim Kraska

Stage: Query Execution Time Prediction in Amazon Redshift

Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrishnan Narayanaswamy, Tim Kraska

TL;DR

The paper tackles the challenge of predicting query execution time in Amazon Redshift to support critical downstream tasks like admission, scheduling, and optimization. It introduces Stage, a hierarchical predictor composed of an exec-time cache, a local Bayesian ensemble of XGBoost models, and a global graph-based model trained across many instances. By cascading these components, Stage achieves fast, uncertainty-aware predictions that handle cold-start and workload drift, delivering about a $20\%$ reduction in end-to-end latency on production-like workloads. The work demonstrates practical deployment of multi-stage ML in a production DBMS, provides extensive ablations and uncertainty analyses, and discusses future directions for applying hierarchical predictors to other tasks and environment-aware prediction challenges.

Abstract

Query performance (e.g., execution time) prediction is a critical component of modern DBMSes. As a pioneering cloud data warehouse, Amazon Redshift relies on an accurate execution time prediction for many downstream tasks, ranging from high-level optimizations, such as automatically creating materialized views, to low-level tasks on the critical path of query execution, such as admission, scheduling, and execution resource control. Unfortunately, many existing execution time prediction techniques, including those used in Redshift, suffer from cold start issues, inaccurate estimation, and are not robust against workload/data changes. In this paper, we propose a novel hierarchical execution time predictor: the Stage predictor. The Stage predictor is designed to leverage the unique characteristics and challenges faced by Redshift. The Stage predictor consists of three model states: an execution time cache, a lightweight local model optimized for a specific DB instance with uncertainty measurement, and a complex global model that is transferable across all instances in Redshift. We design a systematic approach to use these models that best leverages optimality (cache), instance-optimization (local model), and transferable knowledge about Redshift (global model). Experimentally, we show that the Stage predictor makes more accurate and robust predictions while maintaining a practical inference latency and memory overhead. Overall, the Stage predictor can improve the average query execution latency by $20\%$ on these instances compared to the prior query performance predictor in Redshift.

Stage: Query Execution Time Prediction in Amazon Redshift

TL;DR

reduction in end-to-end latency on production-like workloads. The work demonstrates practical deployment of multi-stage ML in a production DBMS, provides extensive ablations and uncertainty analyses, and discusses future directions for applying hierarchical predictors to other tasks and environment-aware prediction challenges.

Abstract

on these instances compared to the prior query performance predictor in Redshift.

Paper Structure (55 sections, 2 equations, 11 figures, 6 tables)

This paper contains 55 sections, 2 equations, 11 figures, 6 tables.

Introduction
Background
Execution time predictor in Redshift
Importance of exec-time predictor
The prior AutoWLM exec-time predictor in Redshift
Related Works
Uncertainty of XGBoost models
Instance-optimized exec-time predictor
Zero-shot exec-time predictor
Machine Learning for Databases
Design Principles
Repeated queries
High-confidence predictions
Low inference latency
Stage Predictor
...and 40 more sections

Figures (11)

Figure 1: Distribution statistics from the Redshift fleet
Figure 2: The key components and ideas of Stage predictor.
Figure 3: Queries' lifetime inside Redshift.
Figure 4: Workflow overview of the Stage predictor: Queries are first planned by the query optimizer and then featurized into a vector. If the exec-time cache has a match for the feature vector, the cached exec-time will be returned. Otherwise, the feature vector will go to the local model. If the local model predicts that a query is short-running or if it is highly confident, its prediction will be returned. Otherwise, the global model will use the original physical plan and stats from the user's specific database to make a prediction. After execution, the pair of vector representation and exec-time is added to the cache and local training data (not shown).
Figure 5: Global model featurization and architecture.
...and 6 more figures

Stage: Query Execution Time Prediction in Amazon Redshift

TL;DR

Abstract

Stage: Query Execution Time Prediction in Amazon Redshift

Authors

TL;DR

Abstract

Table of Contents

Figures (11)