Table of Contents
Fetching ...

A Nested Krylov Method Using Half-Precision Arithmetic

Kengo Suzuki, Takeshi Iwashita

TL;DR

The paper introduces F3R, a three-level nested Krylov solver that integrates flexible GMRES and Richardson within a multi-precision framework to exploit half-precision arithmetic. By progressively reducing precision from $fp64$ to $fp16$ and executing inner solvers only a few iterations per outer iteration, F3R achieves substantial speedups while maintaining convergence, outperforming restarted FGMRES, CG, and BiCGStab on CPU and GPU benchmarks. A key contribution is an adaptive weight-update scheme for the innermost Richardson step, which enhances stability across diverse problems. The results demonstrate practical gains in memory bandwidth-limited sparse linear solves and suggest pathways for asynchronous and distributed extensions.

Abstract

Low-precision computing is essential for efficiently utilizing memory bandwidth and computing cores. While many mixed-precision algorithms have been developed for iterative sparse linear solvers, effectively leveraging half-precision (fp16) arithmetic remains challenging. This study introduces a novel nested Krylov approach that integrates the flexible GMRES and Richardson methods in a deeply nested structure, progressively reducing precision from double-precision to fp16 toward the innermost solver. To avoid meaningless computations beyond precision limits, the low-precision inner solvers perform only a few iterations per invocation, while the nested structure ensures their frequent execution. Numerical experiments show that using fp16 in the approach directly enhances solver performance without compromising convergence, achieving speedups of up to 1.65x and 2.42x over double-precision and double-single mixed-precision implementations, respectively. Moreover, the proposed method outperforms or matches other standard Krylov solvers, including restarted GMRES, CG, and BiCGStab methods.

A Nested Krylov Method Using Half-Precision Arithmetic

TL;DR

The paper introduces F3R, a three-level nested Krylov solver that integrates flexible GMRES and Richardson within a multi-precision framework to exploit half-precision arithmetic. By progressively reducing precision from to and executing inner solvers only a few iterations per outer iteration, F3R achieves substantial speedups while maintaining convergence, outperforming restarted FGMRES, CG, and BiCGStab on CPU and GPU benchmarks. A key contribution is an adaptive weight-update scheme for the innermost Richardson step, which enhances stability across diverse problems. The results demonstrate practical gains in memory bandwidth-limited sparse linear solves and suggest pathways for asynchronous and distributed extensions.

Abstract

Low-precision computing is essential for efficiently utilizing memory bandwidth and computing cores. While many mixed-precision algorithms have been developed for iterative sparse linear solvers, effectively leveraging half-precision (fp16) arithmetic remains challenging. This study introduces a novel nested Krylov approach that integrates the flexible GMRES and Richardson methods in a deeply nested structure, progressively reducing precision from double-precision to fp16 toward the innermost solver. To avoid meaningless computations beyond precision limits, the low-precision inner solvers perform only a few iterations per invocation, while the nested structure ensures their frequent execution. Numerical experiments show that using fp16 in the approach directly enhances solver performance without compromising convergence, achieving speedups of up to 1.65x and 2.42x over double-precision and double-single mixed-precision implementations, respectively. Moreover, the proposed method outperforms or matches other standard Krylov solvers, including restarted GMRES, CG, and BiCGStab methods.

Paper Structure

This paper contains 17 sections, 8 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Performance relative to fp64-F3R on the CPU node. No bar indicates that the convergence failed. Each table above the plots shows the parameters of fp16-F3R-best in $m_2$-$m_3$-$m_4$ at the top row and the execution time of baseline, fp64-F3R, in seconds at the bottom row.
  • Figure 2: Performance relative to fp64-F3R on the GPU node. No bar indicates that the convergence failed. Each table above the plots shows the parameters of fp16-F3R-best in $m_2$-$m_3$-$m_4$ at the top row and the execution time of baseline, fp64-F3R, in seconds at the bottom row.
  • Figure 3: Results for different values of $m_2$, $m_3$, and $m_4$. Boxplots on the right and top correspond to the y- and x-axis respectively. The results are relative to fp16-F3R with the default setting ($m_2,m_3,m_4 = 8,4,2$), and the larger, the better in both axes.
  • Figure 4: Relationship between performance and the depth of nesting. The results are relative to fp16-F3R with the default setting, and the larger, the better.
  • Figure 5: Performance balance when changing the weight-updating cycle $c$ in the Richardson part. The results are relative to fp16-F3R with $c = 64$.
  • ...and 1 more figures