Table of Contents
Fetching ...

Zero-Execution Retrieval-Augmented Configuration Tuning of Spark Applications

Raunaq Suri, Ilan Gofman, Guangwei Yu, Jesse C. Cresswell

TL;DR

This work tackles tuning Spark configurations for ad hoc workloads without any runtime executions by reframing the problem as retrieval over historical data. The proposed method, ZEST, embeds Spark logical plans and retrieves top similar workloads to predict configurations, averaging the retrieved results to initialize a new job prior to execution. Empirical results show ZEST achieves about $93.3\%$ of the runtime improvements of one-execution/online optimization methods, while avoiding the initial default-run overhead; it remains competitive with online BO baselines and greatly reduces accumulated costs for infrequent queries on TPC-H and TPC-DS. The authors release a large curated dataset of 19,360 query executions to support future zero-execution tuning research and demonstrate robustness across data sizes and unseen catalogs, highlighting practical benefits for analytics workloads and enterprise workloads with recurring schemas.

Abstract

Large-scale data processing is increasingly done using distributed computing frameworks like Apache Spark, which have a considerable number of configurable parameters that affect runtime performance. For optimal performance, these parameters must be tuned to the specific job being run. Tuning commonly requires multiple executions to collect runtime information for updating parameters. This is infeasible for ad hoc queries that are run once or infrequently. Zero-execution tuning, where parameters are automatically set before a job's first run, can provide significant savings for all types of applications, but is more challenging since runtime information is not available. In this work, we propose a novel method for zero-execution tuning of Spark configurations based on retrieval. Our method achieves 93.3% of the runtime improvement of state-of-the-art one-execution optimization, entirely avoiding the slow initial execution using default settings. The shift to zero-execution tuning results in a lower cumulative runtime over the first 140 runs, and provides the largest benefit for ad hoc and analytical queries which only need to be executed once. We release the largest and most comprehensive suite of Spark query datasets, optimal configurations, and runtime information, which will promote future development of zero-execution tuning methods.

Zero-Execution Retrieval-Augmented Configuration Tuning of Spark Applications

TL;DR

This work tackles tuning Spark configurations for ad hoc workloads without any runtime executions by reframing the problem as retrieval over historical data. The proposed method, ZEST, embeds Spark logical plans and retrieves top similar workloads to predict configurations, averaging the retrieved results to initialize a new job prior to execution. Empirical results show ZEST achieves about of the runtime improvements of one-execution/online optimization methods, while avoiding the initial default-run overhead; it remains competitive with online BO baselines and greatly reduces accumulated costs for infrequent queries on TPC-H and TPC-DS. The authors release a large curated dataset of 19,360 query executions to support future zero-execution tuning research and demonstrate robustness across data sizes and unseen catalogs, highlighting practical benefits for analytics workloads and enterprise workloads with recurring schemas.

Abstract

Large-scale data processing is increasingly done using distributed computing frameworks like Apache Spark, which have a considerable number of configurable parameters that affect runtime performance. For optimal performance, these parameters must be tuned to the specific job being run. Tuning commonly requires multiple executions to collect runtime information for updating parameters. This is infeasible for ad hoc queries that are run once or infrequently. Zero-execution tuning, where parameters are automatically set before a job's first run, can provide significant savings for all types of applications, but is more challenging since runtime information is not available. In this work, we propose a novel method for zero-execution tuning of Spark configurations based on retrieval. Our method achieves 93.3% of the runtime improvement of state-of-the-art one-execution optimization, entirely avoiding the slow initial execution using default settings. The shift to zero-execution tuning results in a lower cumulative runtime over the first 140 runs, and provides the largest benefit for ad hoc and analytical queries which only need to be executed once. We release the largest and most comprehensive suite of Spark query datasets, optimal configurations, and runtime information, which will promote future development of zero-execution tuning methods.

Paper Structure

This paper contains 21 sections, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our zero-execution method for retrieval-augmented tuning of Spark configuration parameters. A query is first processed by the Spark SQL module to generate a logical plan. ZEST embeds the logical plan and retrieves the top-$k$ most similar historical embeddings with their corresponding tuned configurations from a vector database. ZEST returns the mean of retrieved configurations to Spark which then initializes a new application to execute the job.
  • Figure 2: Distribution of configuration values for the optimal parameters for a query in the index on the EMR cluster. The x-axis denotes the values of the configuration parameters and the y-axis denotes the frequency of those parameter values in the optimal configuration across all queries and input data sizes.
  • Figure 3: Accumulated cost of query executions on the TPC-H and TPC-DS benchmarks under different tuning algorithms on the EMR cluster. The x-axis does not count the initial run with default parameters used by methods other than ZEST.
  • Figure 4: Tuning queries on an unseen data catalog. The methods were built on solely the TPC-DS data catalog and queries, and tested on the TPC-H data catalog and queries on the EMR cluster.
  • Figure 5: Template for TPC-H Query 21 from the test set and the corresponding most similar item in the retrieval index, TPC-H Query 2. Highlighted text indicates common query operations, while coloured text indicates the common columns on which they were executed.