Transfer-Learning-Based Autotuning Using Gaussian Copula

Thomas Randall; Jaehoon Koo; Brice Videau; Michael Kruse; Xingfu Wu; Paul Hovland; Mary Hall; Rong Ge; Prasanna Balaprakash

Transfer-Learning-Based Autotuning Using Gaussian Copula

Thomas Randall, Jaehoon Koo, Brice Videau, Michael Kruse, Xingfu Wu, Paul Hovland, Mary Hall, Rong Ge, Prasanna Balaprakash

TL;DR

The paper tackles the high cost of autotuning in heterogeneous HPC systems by introducing a generative transfer-learning approach based on Gaussian Copulas to model high-performing regions across tasks. It formalizes a two-phase pipeline (training on source tasks and conditional sampling for a target task) and augments GC with quantile filtering and a budget-estimation mechanism to enable few-shot effectiveness. The authors demonstrate substantial practical benefits on Polybench and Exascale mini-app benchmarks, achieving strong first-evaluation performance and up to $33.39\times$ speedups in some cases, while providing principled estimates of the required evaluation budget. This approach enables rapid, data-efficient autotuning that can adapt to new task configurations with minimal empirical cost, offering a scalable path for performance tuning on diverse HPC platforms.

Abstract

As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39$\times$ speedup, a dramatic improvement over the 20.58$\times$ speedup using prior techniques.

Transfer-Learning-Based Autotuning Using Gaussian Copula

TL;DR

speedups in some cases, while providing principled estimates of the required evaluation budget. This approach enables rapid, data-efficient autotuning that can adapt to new task configurations with minimal empirical cost, offering a scalable path for performance tuning on diverse HPC platforms.

Abstract

speedup, a dramatic improvement over the 20.58

speedup using prior techniques.

Paper Structure (31 sections, 1 equation, 5 figures, 7 tables)

This paper contains 31 sections, 1 equation, 5 figures, 7 tables.

Introduction
Background
Autotuning
Transfer Learning in Autotuning
Gaussian Copula
Proposed Framework
Model Training
GC for Autotuning
Variable Preprocessing
GC as an Autotuner
GC Model Fitting for Few-Shot Tuning
Quantile Filtering
Model Inference
Conditional Sampling
Advantages over Alternative Generative Models
...and 16 more sections

Figures (5)

Figure 1: TL-based Autotuning Framework Using GC. TOP: Model Training, which uses GC to train fitted models with data collected from source tasks (multiple input sizes of an application) in a human-designed tuning space. BOTTOM: Model Inference, which uses the fitted GC models to propose high-performing configurations for new tasks and evaluates them.
Figure 2: Observed speedup vs. log-scale elapsed time for few-shot TL autotuning. The dotted lines indicate results trimmed to the GC's predicted budget.
Figure 3: Ambiguous responses to tuning yield minimal speedup, but the GC remains competitive with prior work.
Figure 4: Brute-forcing the Syr2k XL task proves that the GC and GPTune can identify the global optimum in 30 evaluations, but the GC avoids poor evaluations, giving it better average performance.
Figure 5: The GC remains competitive with state-of-the-art techniques on complex ECP benchmarks.

Transfer-Learning-Based Autotuning Using Gaussian Copula

TL;DR

Abstract

Transfer-Learning-Based Autotuning Using Gaussian Copula

Authors

TL;DR

Abstract

Table of Contents

Figures (5)