Zero-Execution Retrieval-Augmented Configuration Tuning of Spark Applications
Raunaq Suri, Ilan Gofman, Guangwei Yu, Jesse C. Cresswell
TL;DR
This work tackles tuning Spark configurations for ad hoc workloads without any runtime executions by reframing the problem as retrieval over historical data. The proposed method, ZEST, embeds Spark logical plans and retrieves top similar workloads to predict configurations, averaging the retrieved results to initialize a new job prior to execution. Empirical results show ZEST achieves about $93.3\%$ of the runtime improvements of one-execution/online optimization methods, while avoiding the initial default-run overhead; it remains competitive with online BO baselines and greatly reduces accumulated costs for infrequent queries on TPC-H and TPC-DS. The authors release a large curated dataset of 19,360 query executions to support future zero-execution tuning research and demonstrate robustness across data sizes and unseen catalogs, highlighting practical benefits for analytics workloads and enterprise workloads with recurring schemas.
Abstract
Large-scale data processing is increasingly done using distributed computing frameworks like Apache Spark, which have a considerable number of configurable parameters that affect runtime performance. For optimal performance, these parameters must be tuned to the specific job being run. Tuning commonly requires multiple executions to collect runtime information for updating parameters. This is infeasible for ad hoc queries that are run once or infrequently. Zero-execution tuning, where parameters are automatically set before a job's first run, can provide significant savings for all types of applications, but is more challenging since runtime information is not available. In this work, we propose a novel method for zero-execution tuning of Spark configurations based on retrieval. Our method achieves 93.3% of the runtime improvement of state-of-the-art one-execution optimization, entirely avoiding the slow initial execution using default settings. The shift to zero-execution tuning results in a lower cumulative runtime over the first 140 runs, and provides the largest benefit for ad hoc and analytical queries which only need to be executed once. We release the largest and most comprehensive suite of Spark query datasets, optimal configurations, and runtime information, which will promote future development of zero-execution tuning methods.
