Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices
Sreeram Venkat, Kasia Swirydowicz, Noah Wolfe, Omar Ghattas
TL;DR
This work tackles performance portability and mixed-precision acceleration for FFT-based matvecs on block-triangular Toeplitz matrices arising in Bayesian inverse problems. It employs an on-the-fly HIP translation to run CUDA-origin FFTMatvec on AMD hardware, and enhances performance by integrating a tailored conjugate-transpose SBGEMV kernel into rocBLAS. A dynamic mixed-precision framework is then used to select precision configurations via Pareto-front analysis, achieving substantial speedups (up to ~95% on some AMD GPUs and ~30% at extreme scales) while controlling error below target tolerances. The approach is validated on AMD Instinct MI250X/MI300X/MI355X and scaled to 4,096 GPUs on Frontier, illustrating significant time-to-solution gains for outer-loop tasks like optimal sensor placement, and the methodology is broadly applicable to FFT-based Hessian actions in large-scale inverse problems $($e.g., with parameters $N_m$, $N_d$, $N_t$$)$.$
Abstract
The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.
