Table of Contents
Fetching ...

Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

Vicki Carrica, Maxwell Onyango, Rabab Alomairy, Evelyne Ringoot, James Schloss, Alan Edelman

TL;DR

This work addresses the challenge of delivering high-performance TRMM and TRSM on GPUs across heterogeneous hardware with a portable API. It introduces a recursive, GEMM-centric formulation implemented in Julia using GPUArrays.jl and KernelAbstractions.jl, producing hardware-agnostic kernels that run on NVIDIA, AMD, and Apple Silicon GPUs. The approach achieves throughput comparable to cuBLAS/rocBLAS for large matrices while maintaining a compact codebase of a few hundred lines, and it marks the first Apple Silicon support for these routines. The results demonstrate the feasibility of performance portability in dense linear algebra, enabling scalable, hardware-diverse deployment without vendor-specific abstractions, and point toward broader extensions to other triangular operations and multi-core environments.

Abstract

This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.

Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

TL;DR

This work addresses the challenge of delivering high-performance TRMM and TRSM on GPUs across heterogeneous hardware with a portable API. It introduces a recursive, GEMM-centric formulation implemented in Julia using GPUArrays.jl and KernelAbstractions.jl, producing hardware-agnostic kernels that run on NVIDIA, AMD, and Apple Silicon GPUs. The approach achieves throughput comparable to cuBLAS/rocBLAS for large matrices while maintaining a compact codebase of a few hundred lines, and it marks the first Apple Silicon support for these routines. The results demonstrate the feasibility of performance portability in dense linear algebra, enabling scalable, hardware-diverse deployment without vendor-specific abstractions, and point toward broader extensions to other triangular operations and multi-core environments.

Abstract

This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.

Paper Structure

This paper contains 13 sections, 4 equations, 3 figures.

Figures (3)

  • Figure 1: TRMM/TRSM Recursive Illustration.
  • Figure 2: Runtime of recursive unified TRMM (top row) and TRSM (bottom row) functions across different GPU hardware platforms (Apple, AMD, NVIDIA) as a function of the size of the matrix $A \in \mathbb{R}^{ n \times n}$ for a rectangular matrix $B \in \mathbb{R}^{ n \times 256}$, both of single precision. The figure shows a similar performance trend across hardware, demonstrating similar performance trends on three different hardware setups.
  • Figure 3: Runtime ratio of cuBLAS/rocBLAS versus the Julia implementation of TRMM (top row) and TRMM (bottom row) in function of the size of matrix $A \in \mathbb{R}^{ n \times n}$ . Higher values indicate that the Julia implementation is faster, 100% indicates equal performance. The left two figures are for a matrix $\in \mathbb{R}^{ n \times 256}$ having a set width. The right two figures show the case of a square matrix $B \in \mathbb{R}^{ n \times n}$. The figures demonstrates the unified implementation generally performs on par with state-of-the-art specific optimized cuBLAS/rocBLAS libraries.