ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

Milena Veneva

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

Milena Veneva

TL;DR

The paper tackles the problem of auto-tuning the optimum sub-system size $m$ in the CUDA-based parallel partition method for SLAEs with tridiagonal coefficient matrices. It leverages a data-driven approach using $k$NN (with $k=1$) to map SLAE size $N$ to the best $m$, and extends the framework to predict the number of recursive steps in a recursive partition variant. Empirical studies across GPUs and precisions show that a corrected dataset enables $1$-NN to reach $100\%$ normalised accuracy in predicting $m$, with speed-ups up to $1.7\times$ for FP64 and $1.17\times$ for recursion; FP32 requires a separate model due to distinct behavior. The results demonstrate a practical auto-tuning pathway for GPU HPC kernels, reducing manual tuning effort and enabling cross-hardware applicability while highlighting memory-alignment constraints and precision-dependent differences.

Abstract

This paper presents a machine learning (ML)-based heuristic for finding the optimum sub-system size for the CUDA implementation of the parallel partition algorithm. Computational experiments for different system of linear algebraic equation (SLAE) sizes are conducted, and the optimum sub-system size for each of them is found empirically. To estimate a model for the sub-system size, we perform the k-nearest neighbors (kNN) classification method. Statistical analysis of the results is done. By comparing the predicted values with the actual data, the algorithm is deemed to be acceptably good. Next, the heuristic is expanded to work for the recursive parallel partition algorithm as well. An algorithm for determining the optimum sub-system size for each recursive step is formulated. A kNN model for predicting the optimum number of recursive steps for a particular SLAE size is built.

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

TL;DR

The paper tackles the problem of auto-tuning the optimum sub-system size

in the CUDA-based parallel partition method for SLAEs with tridiagonal coefficient matrices. It leverages a data-driven approach using

NN (with

) to map SLAE size

to the best

, and extends the framework to predict the number of recursive steps in a recursive partition variant. Empirical studies across GPUs and precisions show that a corrected dataset enables

-NN to reach

normalised accuracy in predicting

, with speed-ups up to

for FP64 and

for recursion; FP32 requires a separate model due to distinct behavior. The results demonstrate a practical auto-tuning pathway for GPU HPC kernels, reducing manual tuning effort and enabling cross-hardware applicability while highlighting memory-alignment constraints and precision-dependent differences.

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

TL;DR

Abstract

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)