Table of Contents
Fetching ...

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Sreeram Venkat, Kasia Swirydowicz, Noah Wolfe, Omar Ghattas

TL;DR

This work tackles performance portability and mixed-precision acceleration for FFT-based matvecs on block-triangular Toeplitz matrices arising in Bayesian inverse problems. It employs an on-the-fly HIP translation to run CUDA-origin FFTMatvec on AMD hardware, and enhances performance by integrating a tailored conjugate-transpose SBGEMV kernel into rocBLAS. A dynamic mixed-precision framework is then used to select precision configurations via Pareto-front analysis, achieving substantial speedups (up to ~95% on some AMD GPUs and ~30% at extreme scales) while controlling error below target tolerances. The approach is validated on AMD Instinct MI250X/MI300X/MI355X and scaled to 4,096 GPUs on Frontier, illustrating significant time-to-solution gains for outer-loop tasks like optimal sensor placement, and the methodology is broadly applicable to FFT-based Hessian actions in large-scale inverse problems $($e.g., with parameters $N_m$, $N_d$, $N_t$$)$.$

Abstract

The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

TL;DR

This work tackles performance portability and mixed-precision acceleration for FFT-based matvecs on block-triangular Toeplitz matrices arising in Bayesian inverse problems. It employs an on-the-fly HIP translation to run CUDA-origin FFTMatvec on AMD hardware, and enhances performance by integrating a tailored conjugate-transpose SBGEMV kernel into rocBLAS. A dynamic mixed-precision framework is then used to select precision configurations via Pareto-front analysis, achieving substantial speedups (up to ~95% on some AMD GPUs and ~30% at extreme scales) while controlling error below target tolerances. The approach is validated on AMD Instinct MI250X/MI300X/MI355X and scaled to 4,096 GPUs on Frontier, illustrating significant time-to-solution gains for outer-loop tasks like optimal sensor placement, and the methodology is broadly applicable to FFT-based Hessian actions in large-scale inverse problems e.g., with parameters , , .$

Abstract

The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.

Paper Structure

This paper contains 34 sections, 12 equations, 4 figures.

Figures (4)

  • Figure 1: Performance comparison of rocBLAS vs. optimized implementation of strided batched GEMV (conjugate) transpose kernel for short, wide matrices ($m \leq n$) on an AMD Instinct™ MI300X GPU. Conjugate transpose is benchmarked for complex datatypes, and regular transpose is benchmarked for real datatypes. A batch size of 100 is used for all tests. Performance is measured by memory bandwidth as determined by the rocblas-bench benchmark. Bars are annotated with the percentage of peak memory bandwidth. The optimized kernel achieves greater relative performance on more skewed rectangular matrices than on square matrices and on lighter datatypes like real single than the heavy datatypes like double complex. See \ref{['sec:perf-opt']} for details on the optimized kernel implementation.
  • Figure 2: Runtime breakdown of FFTMatvec running on AMD Instinct™ MI250X (Single GCD), MI300X, and MI355X GPUs. The SBGEMV comprises the majority ($\sim$92%) of the runtime. The left bar in each cluster shows the results for the $\mathbf{F}$ matvec, and the right bar shows the results for the $\mathbf{F}^*$ matvec. For all tests, $N_m = 5{,}000, N_d=100,$ and $N_t=1{,}000$. The observed trend in performance corresponds roughly to the peak memory bandwidths of the different GPUs.
  • Figure 3: Double-precision vs. optimal mixed-precision configuration runtime breakdown of FFTMatvec ($\mathbf{F}$ matvec) running on AMD Instinct™ MI250X (Single GCD), MI300X, and MI355X GPUs. The left bar in each cluster shows the baseline double-precision matvec, and the right bar shows the results for the matvec with optimal mixed-precision configuration for a relative error tolerance threshold of $10^{-7}$. Transparency is used to indicate a single-precision computational phase, while opacity indicates double precision. For all tests, $N_m = 5{,}000, N_d=100,$ and $N_t=1{,}000$.
  • Figure 4: Speedups and relative errors of optimal mixed-precision configurations compared to the double-precision baseline when scaling from 8 to 4,096 GPUs on the Frontier supercomputer ($\mathbf{F}$ matvec only; $\mathbf{F}^*$ results are similar). Communication-aware partitioning was used to select the optimal processor grid shape for each number of GPUs. The global problem size for $p$ GPUs was set to $N_m=5,000p$, $N_d=100$, and $N_t=1,000$. On 4,096 GPUs, a matvec with over 20 billion parameters ($N_mN_t$) is computed in $\sim 0.11s$.

Theorems & Definitions (1)

  • Remark 1