Table of Contents
Fetching ...

FRSZ2 for In-Register Block Compression Inside GMRES on GPUs

Thomas Grützmacher, Robert Underwood, Sheng Di, Franck Cappello, Hartwig Anzt

TL;DR

This work tackles the memory bandwidth bottleneck of GMRES on GPUs by introducing FRSZ2, an in-register, block-based compressor designed to accelerate CB-GMRES without sacrificing final accuracy. FRSZ2 groups Krylov basis values into blocks, decorrelates exponent information, and encodes signs and significands into compact l-bit values, enabling decompression at memory bandwidth speeds on GPUs such as the NVIDIA H100. The authors demonstrate that FRSZ2_, particularly frsz2_32, yields the best convergence among tested schemes and can deliver end-to-end speedups up to 1.3x over uncompressed CB-GMRES and 1.2–3.1x faster than cuSZp2, approaching 99.6% of peak bandwidth. The results indicate meaningful practical impact for accelerating memory-bound Krylov solvers on modern GPUs, with future work aimed at generalization and predictive deployment to maximize benefits.

Abstract

The performance of the GMRES iterative solver on GPUs is limited by the GPU main memory bandwidth. Compressed Basis GMRES outperforms GMRES by storing the Krylov basis in low precision, thereby reducing the memory access. An open question is whether compression techniques that are more sophisticated than casting to low precision can enable large runtime savings while preserving the accuracy of the final results. This paper presents the lightweight in-register compressor FRSZ2 that can decompress at the bandwidth speed of a modern NVIDIA H100 GPU. In an experimental evaluation, we demonstrate using FRSZ2 instead of low precision for compression of the Krylov basis can bring larger runtime benefits without impacting final accuracy.

FRSZ2 for In-Register Block Compression Inside GMRES on GPUs

TL;DR

This work tackles the memory bandwidth bottleneck of GMRES on GPUs by introducing FRSZ2, an in-register, block-based compressor designed to accelerate CB-GMRES without sacrificing final accuracy. FRSZ2 groups Krylov basis values into blocks, decorrelates exponent information, and encodes signs and significands into compact l-bit values, enabling decompression at memory bandwidth speeds on GPUs such as the NVIDIA H100. The authors demonstrate that FRSZ2_, particularly frsz2_32, yields the best convergence among tested schemes and can deliver end-to-end speedups up to 1.3x over uncompressed CB-GMRES and 1.2–3.1x faster than cuSZp2, approaching 99.6% of peak bandwidth. The results indicate meaningful practical impact for accelerating memory-bound Krylov solvers on modern GPUs, with future work aimed at generalization and predictive deployment to maximize benefits.

Abstract

The performance of the GMRES iterative solver on GPUs is limited by the GPU main memory bandwidth. Compressed Basis GMRES outperforms GMRES by storing the Krylov basis in low precision, thereby reducing the memory access. An open question is whether compression techniques that are more sophisticated than casting to low precision can enable large runtime savings while preserving the accuracy of the final results. This paper presents the lightweight in-register compressor FRSZ2 that can decompress at the bandwidth speed of a modern NVIDIA H100 GPU. In an experimental evaluation, we demonstrate using FRSZ2 instead of low precision for compression of the Krylov basis can bring larger runtime benefits without impacting final accuracy.
Paper Structure (19 sections, 4 equations, 11 figures, 2 tables)

This paper contains 19 sections, 4 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Algorithmic formulation of the restarted GMRES algorithm for solving sparse linear systems. Sections where compression can be used are highlighted.
  • Figure 2: Histogram of Exponents and values from the atmosmodd matrix. Only the exponent has a few common values, but values are normally distributed making decorrelation difficult.
  • Figure 3: FRSZ2 compression steps (BS $= 2$ and arbitrary $l>2$).
  • Figure 4: Performance on the H100
  • Figure 5: Residual norm development for the atmosmodd matrix with various compressions.
  • ...and 6 more figures