Table of Contents
Fetching ...

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia

Emmanuel Pilliat

Abstract

Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia

Abstract

Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.
Paper Structure (36 sections, 6 figures, 8 tables)

This paper contains 36 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Bandwidth (GB/s) for vectorized copy as a function of problem size. Empirical bandwidth measured via kernel timing with CUDA.@profile, shown for CUDA.jl (which internally calls libcuda) and for KernelForge.jl with 1, 4, and 8 items per thread. The dashed vertical line indicates the L2 cache size divided by $2 \times \texttt{sizeof(element)}$. Peak bandwidth is achieved with 128-bit loads, where KernelForge.jl outperforms CUDA.jl.
  • Figure 2: Vector-matrix thread organization for a wide, short matrix. The $x$-axis represents columns and the $y$-axis represents rows. Each warp (dashed yellow boundary) is assigned 4 consecutive columns; threads stride vertically across rows, with the blue and grey regions corresponding to the first and second row strides, respectively, each thread loading 4 elements per stride. Each workgroup of 128 threads (solid orange boundary) thus covers 16 columns. This layout maintains coalesced memory access while keeping all threads occupied across multiple strides.
  • Figure 3: Reduction Benchmark Across Implementations (A40). Comparison of four implementations for the parallel sum operation: CUDA.jl, AcceleratedKernels.jl, KernelForge.jl, and CUB (compiled with nvcc). Results are shown for two input sizes: $n = 10^7$ (left) and $n = 10^8$ (right), and two data types: Float32 and UInt8. Dark bars show kernel execution time; light bars include launch overhead. Error bars indicate variability across runs. Since UnitFloat8 is a Julia-specific type, the CUB benchmark uses a dummy UInt8 summation for reference.
  • Figure 4: Scan Benchmark Across Implementations (A40). Same implementations as the reduction benchmark. Each algorithm is tested with two data types: Float32 and Float64. For reference, Kokkos achieves $0.84\,\text{ms}$, $7.4\,\text{ms}$, and $73.4\,\text{ms}$ on Float64 for $n = 10^7$, $10^8$, and $10^9$ respectively, representing a $2.6\times$ overhead relative to KernelForge.jl and CUB.
  • Figure 5: Vector-Matrix Product Benchmark Across Matrix Shapes (A40). Throughput comparison between cuBLAS (via CUDA.jl) and KernelForge for Float32 vector-matrix multiplication. The total input data size $n \times p$ is fixed at $10^7$ (left) and $10^8$ (right), with varying aspect ratios to assess performance across different memory access patterns.
  • ...and 1 more figures