Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Yufan Xia; Giuseppe Maria Junior Barca

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Yufan Xia, Giuseppe Maria Junior Barca

TL;DR

This work tackles the challenge of optimally selecting the number of threads for BLAS Level 3 routines on modern multi-core systems. It extends the ADSALA framework to cover all L3 subroutines by learning per-subroutine, architecture-specific thread-prediction models trained during installation and used at runtime. The approach achieves substantial performance gains (approximately $1.5\times$ to $3.0\times$ speedups) on two HPC platforms (Setonix and Gadi) compared with using the maximum available threads, and it analyzes the sources of speedup through profiling and runtime patterns. The results demonstrate the generality of the ML-driven runtime optimization for BLAS L3 operations and point to future directions including broader hardware support and potential GPU integration.

Abstract

BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

TL;DR

speedups) on two HPC platforms (Setonix and Gadi) compared with using the maximum available threads, and it analyzes the sources of speedup through profiling and runtime patterns. The results demonstrate the generality of the ML-driven runtime optimization for BLAS L3 operations and point to future directions including broader hardware support and potential GPU integration.

Abstract

Paper Structure (23 sections, 1 equation, 7 figures, 8 tables)

This paper contains 23 sections, 1 equation, 7 figures, 8 tables.

Introduction
Background
BLAS Level III Subroutines
Machine Learning Algorithms
Data Preprocessing Techniques
Software Workflow
Installation Workflow
Runtime Workflow
Machine Learning Methods
Mechanism for Predictions
Data Gathering
Feature Engineering and Data Preprocessing
Model Selection
Experimentation Information
Experimentation Platforms
...and 8 more sections

Figures (7)

Figure 1: The software design of ADSALA.
Figure 2: A schematic diagram for the 2-socket EPYC CPU configuration on Setonix.
Figure 3: A schematic diagram of the 2-socket Cascade Lake CPU configuration; sockets are connected using Intel$^{\circledR}$ UPI (Ultra Path Interconnect).
Figure 4: Heatmap of the optimal number of threads on Setonix and Gadi, concerning all BLAS level III subroutines except GEMM. The horizontal and vertical axes use a square root scale.
Figure 5: Heatmap of the optimal number of threads on Setonix and Gadi. The horizontal and vertical axes use a square root scale. The dashed lines on each sub-graph are contour lines of the sampling domain with each label showing the value of the third dimension.
...and 2 more figures

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

TL;DR

Abstract

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (7)