Table of Contents
Fetching ...

Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality

Adrian Perez Dieguez, Min Choi, Mahmut Okyay, Mauro Del Ben, Bryan M. Wong, Khaled Z. Ibrahim

TL;DR

The methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches, and reduces the search time by up to 95%; compared to full-independent searches, it leads to final configurations up to 8% more accurate.

Abstract

Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches for each routine, thereby overlooking interdependence, or pursuing a more resource-intensive joint search for all routines. This decision is driven by the consideration that some interdependence analysis and high-dimensional decomposition techniques in literature may be prohibitively expensive in HPC tuning searches. Our methodology adapts and refines these methods to ensure computational feasibility while maximizing performance gains in real-world scenarios. Our methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches. Tested on synthetic functions with varying levels of parameter interdependence, our methodology efficiently explores the search space. In comparison to Bayesian-optimization-based full independent or fully joint searches, our methodology suggested an optimized breakdown of independent and merged searches that led to final configurations up to 8% more accurate, reducing the search time by up to 95%. When applied to GPU-offloaded Real-Time Time-Dependent Density Functional Theory (RT-TDDFT), an application in computational materials science that challenges modern HPC autotuners, our methodology achieved an effective tuning search. Its adaptability and efficiency extend beyond RT-TDDFT, making it valuable for related applications in HPC.

Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality

TL;DR

The methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches, and reduces the search time by up to 95%; compared to full-independent searches, it leads to final configurations up to 8% more accurate.

Abstract

Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches for each routine, thereby overlooking interdependence, or pursuing a more resource-intensive joint search for all routines. This decision is driven by the consideration that some interdependence analysis and high-dimensional decomposition techniques in literature may be prohibitively expensive in HPC tuning searches. Our methodology adapts and refines these methods to ensure computational feasibility while maximizing performance gains in real-world scenarios. Our methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches. Tested on synthetic functions with varying levels of parameter interdependence, our methodology efficiently explores the search space. In comparison to Bayesian-optimization-based full independent or fully joint searches, our methodology suggested an optimized breakdown of independent and merged searches that led to final configurations up to 8% more accurate, reducing the search time by up to 95%. When applied to GPU-offloaded Real-Time Time-Dependent Density Functional Theory (RT-TDDFT), an application in computational materials science that challenges modern HPC autotuners, our methodology achieved an effective tuning search. Its adaptability and efficiency extend beyond RT-TDDFT, making it valuable for related applications in HPC.
Paper Structure (17 sections, 1 equation, 6 figures, 7 tables)

This paper contains 17 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Synthetic 20-dim function body where $A_i=10\cdot cos(2\pi\cdot(x_i -1))+\epsilon$, with $\epsilon$ representing random noise. The template box outlined with Group 3 is replaced with a set of implementations to create different synthetic functions.
  • Figure 2: The DAG diagram illustrates the interdependencies between variables and groups during the tuning process for synthetic Case 3 (medium influence), after applying a 25% cut-off.
  • Figure 3: Mapping of the wavefunction computation (left) into MPI tasks (right). The original CPU MPI partition is on the top, while the GPU partition is on the bottom. The spin dimension is 1 to enable a 3D representation.
  • Figure 4: Pseudo-code for the QBox-based RT-TDDFT that summarizes its dominant computational pattern.
  • Figure 5: Diagram of the resulting dependencies on searches
  • ...and 1 more figures