Table of Contents
Fetching ...

MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning

Mathys Jam, Eric Petit, Pablo de Oliveira Castro, David Defour, Greg Henry, William Jalby

TL;DR

MLKAPS tackles the challenge of auto-tuning HPC kernels across vast input and design spaces by learning decision trees that map runtime input parameters to optimized design configurations. It introduces GA-Adaptive sampling, alongside space-filling and HVS strategies, to efficiently build a global surrogate model (GBDT) and then generate runtime decision trees that select kernel configurations. Empirical results on Intel MKL dgetrf/dgeqrf and ScaLAPACK demonstrate strong speedups and scalable performance, outperforming Optuna and GPTune in large spaces and revealing RK-level blind spots in hand-tuned configurations. The work offers practical, open-source tooling for generating runnable C-code decision trees embedded in kernels, enabling broadly impactful, context-aware HPC optimization.

Abstract

Many High-Performance Computing (HPC) libraries rely on decision trees to select the best kernel hyperparameters at runtime,depending on the input and environment. However, finding optimized configurations for each input and environment is challengingand requires significant manual effort and computational resources. This paper presents MLKAPS, a tool that automates this task usingmachine learning and adaptive sampling techniques. MLKAPS generates decision trees that tune HPC kernels' design parameters toachieve efficient performance for any user input. MLKAPS scales to large input and design spaces, outperforming similar state-of-the-artauto-tuning tools in tuning time and mean speedup. We demonstrate the benefits of MLKAPS on the highly optimized Intel MKLdgetrf LU kernel and show that MLKAPS finds blindspots in the manual tuning of HPC experts. It improves over 85% of the inputswith a geomean speedup of x1.30. On the Intel MKL dgeqrf QR kernel, MLKAPS improves performance on 85% of the inputs with ageomean speedup of x1.18.

MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning

TL;DR

MLKAPS tackles the challenge of auto-tuning HPC kernels across vast input and design spaces by learning decision trees that map runtime input parameters to optimized design configurations. It introduces GA-Adaptive sampling, alongside space-filling and HVS strategies, to efficiently build a global surrogate model (GBDT) and then generate runtime decision trees that select kernel configurations. Empirical results on Intel MKL dgetrf/dgeqrf and ScaLAPACK demonstrate strong speedups and scalable performance, outperforming Optuna and GPTune in large spaces and revealing RK-level blind spots in hand-tuned configurations. The work offers practical, open-source tooling for generating runnable C-code decision trees embedded in kernels, enabling broadly impactful, context-aware HPC optimization.

Abstract

Many High-Performance Computing (HPC) libraries rely on decision trees to select the best kernel hyperparameters at runtime,depending on the input and environment. However, finding optimized configurations for each input and environment is challengingand requires significant manual effort and computational resources. This paper presents MLKAPS, a tool that automates this task usingmachine learning and adaptive sampling techniques. MLKAPS generates decision trees that tune HPC kernels' design parameters toachieve efficient performance for any user input. MLKAPS scales to large input and design spaces, outperforming similar state-of-the-artauto-tuning tools in tuning time and mean speedup. We demonstrate the benefits of MLKAPS on the highly optimized Intel MKLdgetrf LU kernel and show that MLKAPS finds blindspots in the manual tuning of HPC experts. It improves over 85% of the inputswith a geomean speedup of x1.30. On the Intel MKL dgeqrf QR kernel, MLKAPS improves performance on 85% of the inputs with ageomean speedup of x1.18.
Paper Structure (31 sections, 13 figures, 1 table)

This paper contains 31 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Illustrative kernel summing the elements of a matrix with three input parameters (matrix, n, m), and one design parameter (T).
  • Figure 2: MLKAPS generates a tuning decision tree to select the best number-of-threads T for a given input context.
  • Figure 3: Flowchart of the MLKAPS auto-tuning pipeline. This example illustrates the different steps to build a decision tree that maps one design parameter $Y$ to one input parameter $X$. For an industrial application with a much larger number of parameters, it would be intractable to map the objective function exhaustively as shown here.
  • Figure 4: GA-Adaptive core loop. This algorithm uses a strategy similar to epsilon-decreasing epsilon_decreasing_bianchi to solve the exploration-exploitation dilemma by linearly increasing the number of samples taken for exploitation. Note how the sampling-modeling-optimization steps are similar to the global pipeline. At each iteration, the percentage of points taken with GA and the sub-sampler is computed as a linear interpolation between the initial and final ratios, controlled by the completion percentage (i.e., at $50\%$ completion with an initial ratio of $0$ and a final ratio of $0.8$, GA will pick $40\%$ of points in the next iteration).
  • Figure 5: Hardware architectures used for the experiments.
  • ...and 8 more figures