Table of Contents
Fetching ...

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

Milena Veneva

TL;DR

The paper tackles the problem of auto-tuning the optimum sub-system size $m$ in the CUDA-based parallel partition method for SLAEs with tridiagonal coefficient matrices. It leverages a data-driven approach using $k$NN (with $k=1$) to map SLAE size $N$ to the best $m$, and extends the framework to predict the number of recursive steps in a recursive partition variant. Empirical studies across GPUs and precisions show that a corrected dataset enables $1$-NN to reach $100\%$ normalised accuracy in predicting $m$, with speed-ups up to $1.7\times$ for FP64 and $1.17\times$ for recursion; FP32 requires a separate model due to distinct behavior. The results demonstrate a practical auto-tuning pathway for GPU HPC kernels, reducing manual tuning effort and enabling cross-hardware applicability while highlighting memory-alignment constraints and precision-dependent differences.

Abstract

This paper presents a machine learning (ML)-based heuristic for finding the optimum sub-system size for the CUDA implementation of the parallel partition algorithm. Computational experiments for different system of linear algebraic equation (SLAE) sizes are conducted, and the optimum sub-system size for each of them is found empirically. To estimate a model for the sub-system size, we perform the k-nearest neighbors (kNN) classification method. Statistical analysis of the results is done. By comparing the predicted values with the actual data, the algorithm is deemed to be acceptably good. Next, the heuristic is expanded to work for the recursive parallel partition algorithm as well. An algorithm for determining the optimum sub-system size for each recursive step is formulated. A kNN model for predicting the optimum number of recursive steps for a particular SLAE size is built.

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

TL;DR

The paper tackles the problem of auto-tuning the optimum sub-system size in the CUDA-based parallel partition method for SLAEs with tridiagonal coefficient matrices. It leverages a data-driven approach using NN (with ) to map SLAE size to the best , and extends the framework to predict the number of recursive steps in a recursive partition variant. Empirical studies across GPUs and precisions show that a corrected dataset enables -NN to reach normalised accuracy in predicting , with speed-ups up to for FP64 and for recursion; FP32 requires a separate model due to distinct behavior. The results demonstrate a practical auto-tuning pathway for GPU HPC kernels, reducing manual tuning effort and enabling cross-hardware applicability while highlighting memory-alignment constraints and precision-dependent differences.

Abstract

This paper presents a machine learning (ML)-based heuristic for finding the optimum sub-system size for the CUDA implementation of the parallel partition algorithm. Computational experiments for different system of linear algebraic equation (SLAE) sizes are conducted, and the optimum sub-system size for each of them is found empirically. To estimate a model for the sub-system size, we perform the k-nearest neighbors (kNN) classification method. Statistical analysis of the results is done. By comparing the predicted values with the actual data, the algorithm is deemed to be acceptably good. Next, the heuristic is expanded to work for the recursive parallel partition algorithm as well. An algorithm for determining the optimum sub-system size for each recursive step is formulated. A kNN model for predicting the optimum number of recursive steps for a particular SLAE size is built.

Paper Structure

This paper contains 18 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison between the achieved and the theoretical occupancy.
  • Figure 2: Results from the kNN classification model for the optimum sub-system size.
  • Figure 3: Operations of the non-recursive partition method (top), and the recursive partition method with one recursive step (bottom).
  • Figure 4: Comparison between the times for the partition method with different number of recursions.
  • Figure 5: Results from the kNN classification model for the optimum number of recursive steps.
  • ...and 1 more figures