Table of Contents
Fetching ...

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

Chenghao Lyu, Qi Fan, Philippe Guyard, Yanlei Diao

TL;DR

A novel hybrid compile-time/runtime approach to multi-granularity tuning of diverse, correlated Spark parameters, as well as a suite of modeling and optimization techniques to solve the tuning problem in the MOO setting while meeting the stringent time constraint of 1--2 seconds for cloud use are proposed.

Abstract

As Spark becomes a common big data analytics platform, its growing complexity makes automatic tuning of numerous parameters critical for performance. Our work on Spark parameter tuning is particularly motivated by two recent trends: Spark's Adaptive Query Execution (AQE) based on runtime statistics, and the increasingly popular Spark cloud deployments that make cost-performance reasoning crucial for the end user. This paper presents our design of a Spark optimizer that controls all tunable parameters of each query in the new AQE architecture to explore its performance benefits and, at the same time, casts the tuning problem in the theoretically sound multi-objective optimization (MOO) setting to better adapt to user cost-performance preferences. To this end, we propose a novel hybrid compile-time/runtime approach to multi-granularity tuning of diverse, correlated Spark parameters, as well as a suite of modeling and optimization techniques to solve the tuning problem in the MOO setting while meeting the stringent time constraint of 1-2 seconds for cloud use. Evaluation results using TPC-H and TPC-DS benchmarks demonstrate the superior performance of our approach: (i) When prioritizing latency, it achieves 63% and 65% reduction for TPC-H and TPC-DS, respectively, under an average solving time of 0.7-0.8 sec, outperforming the most competitive MOO method that reduces only 18-25% latency with 2.6-15 sec solving time. (ii) When shifting preferences between latency and cost, our approach dominates the solutions of alternative methods, exhibiting superior adaptability to varying preferences.

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

TL;DR

A novel hybrid compile-time/runtime approach to multi-granularity tuning of diverse, correlated Spark parameters, as well as a suite of modeling and optimization techniques to solve the tuning problem in the MOO setting while meeting the stringent time constraint of 1--2 seconds for cloud use are proposed.

Abstract

As Spark becomes a common big data analytics platform, its growing complexity makes automatic tuning of numerous parameters critical for performance. Our work on Spark parameter tuning is particularly motivated by two recent trends: Spark's Adaptive Query Execution (AQE) based on runtime statistics, and the increasingly popular Spark cloud deployments that make cost-performance reasoning crucial for the end user. This paper presents our design of a Spark optimizer that controls all tunable parameters of each query in the new AQE architecture to explore its performance benefits and, at the same time, casts the tuning problem in the theoretically sound multi-objective optimization (MOO) setting to better adapt to user cost-performance preferences. To this end, we propose a novel hybrid compile-time/runtime approach to multi-granularity tuning of diverse, correlated Spark parameters, as well as a suite of modeling and optimization techniques to solve the tuning problem in the MOO setting while meeting the stringent time constraint of 1-2 seconds for cloud use. Evaluation results using TPC-H and TPC-DS benchmarks demonstrate the superior performance of our approach: (i) When prioritizing latency, it achieves 63% and 65% reduction for TPC-H and TPC-DS, respectively, under an average solving time of 0.7-0.8 sec, outperforming the most competitive MOO method that reduces only 18-25% latency with 2.6-15 sec solving time. (ii) When shifting preferences between latency and cost, our approach dominates the solutions of alternative methods, exhibiting superior adaptability to varying preferences.
Paper Structure (55 sections, 8 theorems, 5 equations, 33 figures, 8 tables, 4 algorithms)

This paper contains 55 sections, 8 theorems, 5 equations, 33 figures, 8 tables, 4 algorithms.

Key Result

Proposition 5.1

Under any specific value $\bm{\theta_{c}}^j$, only subQ-level Pareto optimal solutions $(\bm{\theta_{c}}^j,{\bm{\theta_{p}}^*})$ contribute to the query-level Pareto optimal solutions.

Figures (33)

  • Figure 1: Spark parameters provide mixed control through query compilation and execution
  • Figure 2: Query life cycle with an optimizer for parameter tuning
  • Figure 3: Profiling TPCH-Q9 (12 subQs) over different configurations
  • Figure 4: MOO solutions for TPCH Q2
  • Figure 5: CDF of analytical latency over actual latency
  • ...and 28 more figures

Theorems & Definitions (12)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 5.1
  • Proposition 5.1
  • Proposition 5.2
  • Proposition 5.3
  • Proposition A.1
  • Lemma 1
  • Proposition A.2
  • ...and 2 more