Stage: Query Execution Time Prediction in Amazon Redshift
Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrishnan Narayanaswamy, Tim Kraska
TL;DR
The paper tackles the challenge of predicting query execution time in Amazon Redshift to support critical downstream tasks like admission, scheduling, and optimization. It introduces Stage, a hierarchical predictor composed of an exec-time cache, a local Bayesian ensemble of XGBoost models, and a global graph-based model trained across many instances. By cascading these components, Stage achieves fast, uncertainty-aware predictions that handle cold-start and workload drift, delivering about a $20\%$ reduction in end-to-end latency on production-like workloads. The work demonstrates practical deployment of multi-stage ML in a production DBMS, provides extensive ablations and uncertainty analyses, and discusses future directions for applying hierarchical predictors to other tasks and environment-aware prediction challenges.
Abstract
Query performance (e.g., execution time) prediction is a critical component of modern DBMSes. As a pioneering cloud data warehouse, Amazon Redshift relies on an accurate execution time prediction for many downstream tasks, ranging from high-level optimizations, such as automatically creating materialized views, to low-level tasks on the critical path of query execution, such as admission, scheduling, and execution resource control. Unfortunately, many existing execution time prediction techniques, including those used in Redshift, suffer from cold start issues, inaccurate estimation, and are not robust against workload/data changes. In this paper, we propose a novel hierarchical execution time predictor: the Stage predictor. The Stage predictor is designed to leverage the unique characteristics and challenges faced by Redshift. The Stage predictor consists of three model states: an execution time cache, a lightweight local model optimized for a specific DB instance with uncertainty measurement, and a complex global model that is transferable across all instances in Redshift. We design a systematic approach to use these models that best leverages optimality (cache), instance-optimization (local model), and transferable knowledge about Redshift (global model). Experimentally, we show that the Stage predictor makes more accurate and robust predictions while maintaining a practical inference latency and memory overhead. Overall, the Stage predictor can improve the average query execution latency by $20\%$ on these instances compared to the prior query performance predictor in Redshift.
