Table of Contents
Fetching ...

Mixed precision solvers with half-precision floating point numbers for Lattice QCD on A64FX processor

Issaku Kanamori, Hideo Matsufuru, Tatsumi Aoyama, Kazuyuki Kanaya, Yusuke Namekawa, Hidekatsu Nemura, Keigo Nitadori

TL;DR

This work investigates using FP16 arithmetic in mixed-precision solvers for lattice QCD on the A64FX processor. It identifies stability issues with naive FP16 approaches and introduces rescaling in both the outer iterative refinement and the inner BiCGStab solver, enabling robust convergence when preconditioning with FP16. On a Wilson fermion kernel, the rescaled FP16 method achieves convergence with only modest extra iterations (within about 20% of the FP64 case) and delivers substantial speedups, attaining up to around 8249 GFlops for FP16 compared to 2045 GFlops (FP64) and 3895 GFlops (FP32). The results suggest FP16, when combined with the proposed rescaling techniques, is a viable path for accelerating lattice QCD solvers on ARM/SVE architectures, with potential extension to other preconditioners and more complex fermion matrices.

Abstract

We investigate the use of half-precision floating-point numbers (FP16) in mixed-precision linear solvers for lattice QCD simulations. Since the emergence of GPUs for general-purpose, mixed-precision algorithms that combine single-precision (FP32) with double-precision (FP64) arithmetics have become widely used in this field and others. While FP32-based methods are now well established, we examine the practicality of using FP16. In this work, we introduce rescaling steps in both the outer iterative refinement step and the inner BiCGStab solver to avoid numerical instability. In our experiments with a simple Wilson kernel, the solver shows improved stability, and the additional iteration count compared to the FP64 version remains within 20\%, indicating that the FP16 version is practical for use. We believe that the proposed rescaling methods can also benefit other mixed precision preconditioners in avoiding underflows.

Mixed precision solvers with half-precision floating point numbers for Lattice QCD on A64FX processor

TL;DR

This work investigates using FP16 arithmetic in mixed-precision solvers for lattice QCD on the A64FX processor. It identifies stability issues with naive FP16 approaches and introduces rescaling in both the outer iterative refinement and the inner BiCGStab solver, enabling robust convergence when preconditioning with FP16. On a Wilson fermion kernel, the rescaled FP16 method achieves convergence with only modest extra iterations (within about 20% of the FP64 case) and delivers substantial speedups, attaining up to around 8249 GFlops for FP16 compared to 2045 GFlops (FP64) and 3895 GFlops (FP32). The results suggest FP16, when combined with the proposed rescaling techniques, is a viable path for accelerating lattice QCD solvers on ARM/SVE architectures, with potential extension to other preconditioners and more complex fermion matrices.

Abstract

We investigate the use of half-precision floating-point numbers (FP16) in mixed-precision linear solvers for lattice QCD simulations. Since the emergence of GPUs for general-purpose, mixed-precision algorithms that combine single-precision (FP32) with double-precision (FP64) arithmetics have become widely used in this field and others. While FP32-based methods are now well established, we examine the practicality of using FP16. In this work, we introduce rescaling steps in both the outer iterative refinement step and the inner BiCGStab solver to avoid numerical instability. In our experiments with a simple Wilson kernel, the solver shows improved stability, and the additional iteration count compared to the FP64 version remains within 20\%, indicating that the FP16 version is practical for use. We believe that the proposed rescaling methods can also benefit other mixed precision preconditioners in avoiding underflows.
Paper Structure (7 sections, 6 equations, 9 figures, 2 tables, 5 algorithms)

This paper contains 7 sections, 6 equations, 9 figures, 2 tables, 5 algorithms.

Figures (9)

  • Figure 1: Two-dimensional sketch of the Wilson kernel. The quark fields $\psi(x)$ hops form $x$ to the neighboring lattice sites.
  • Figure 2: A code to read from an array of _Float16 to svfloat32_t. This code does not preserve the order of the elements, but is valid for the reduction purpose.
  • Figure 3: Pseudo code to calculate norm squared. Here, the site degrees of freedom are used for SIMD and thread parallelization, and site_in refers to inside the SIMD variable. Each lattice site has num_on_site variables.
  • Figure 4: Convergence of the FP64 BiCGStab solver: Residual norm squared versus number of matrix-vector multiplications.
  • Figure 5: Convergence of the FP32/FP64 mixed precision solvers. Residual norm squared versus number of matrix-vector multiplications, for the inner solver's convergence criteria set at $10^{-12}$ (top panel) and $10^{-4}$ (bottom). The purple and green curves indicate the residual norms of the inner low-precision and outer FP64 solvers, respectively.
  • ...and 4 more figures