Mixed precision solvers with half-precision floating point numbers for Lattice QCD on A64FX processor
Issaku Kanamori, Hideo Matsufuru, Tatsumi Aoyama, Kazuyuki Kanaya, Yusuke Namekawa, Hidekatsu Nemura, Keigo Nitadori
TL;DR
This work investigates using FP16 arithmetic in mixed-precision solvers for lattice QCD on the A64FX processor. It identifies stability issues with naive FP16 approaches and introduces rescaling in both the outer iterative refinement and the inner BiCGStab solver, enabling robust convergence when preconditioning with FP16. On a Wilson fermion kernel, the rescaled FP16 method achieves convergence with only modest extra iterations (within about 20% of the FP64 case) and delivers substantial speedups, attaining up to around 8249 GFlops for FP16 compared to 2045 GFlops (FP64) and 3895 GFlops (FP32). The results suggest FP16, when combined with the proposed rescaling techniques, is a viable path for accelerating lattice QCD solvers on ARM/SVE architectures, with potential extension to other preconditioners and more complex fermion matrices.
Abstract
We investigate the use of half-precision floating-point numbers (FP16) in mixed-precision linear solvers for lattice QCD simulations. Since the emergence of GPUs for general-purpose, mixed-precision algorithms that combine single-precision (FP32) with double-precision (FP64) arithmetics have become widely used in this field and others. While FP32-based methods are now well established, we examine the practicality of using FP16. In this work, we introduce rescaling steps in both the outer iterative refinement step and the inner BiCGStab solver to avoid numerical instability. In our experiments with a simple Wilson kernel, the solver shows improved stability, and the additional iteration count compared to the FP64 version remains within 20\%, indicating that the FP16 version is practical for use. We believe that the proposed rescaling methods can also benefit other mixed precision preconditioners in avoiding underflows.
