Multiple right hand side multigrid for domain wall fermions with a multigrid preconditioned block conjugate gradient algorithm
Peter A Boyle
TL;DR
This work tackles the critical slowing down of domain wall/Mobius fermion solvers by introducing a multiple right-hand side multigrid approach built on the HDCG framework. It combines a preconditioned BlockCGrQ outer solver with a stationary Chebyshev multigrid preconditioner and two subspace setup strategies (Chebyshev-filtered and Lanczos-derived eigenvectors) to accelerate inversions for several right-hand sides concurrently. The results show substantial performance gains (often exceeding 20x per RHS on large physical-mass lattices) and a sub-dominant coarse-space cost, with robust GPU performance using batched GEMM and Grid infrastructure. The findings demonstrate scalable, high-throughput valence propagator inversions at physical quark masses and offer practical pathways toward faster gauge configurations and broader applicability of MRHS multigrid in lattice QCD computations.
Abstract
We introduce a class of efficient multiple right-hand side multigrid algorithm for domain wall fermions. The simultaneous solution for a modest number of right hand sides concurrently allows for a significant reduction in the time spent solving the coarse grid operator in a multigrid preconditioner. We introduce a preconditioned block conjuate gradient with a multigrid preconditioner, giving additional algorithmic benefit from the multiple right hand sides. There is also a very significant additional to computation rate benefit to multiple right hand sides. This both increases the arithmetic intensity in the coarse space and increases the amount of work being performed in each subroutine call, leading to excellent performance on modern GPU architectures. Further, the software implementation makes use of vendor linear algebra routines (batched GEMM) that can make use of high throughput tensor hardware on recent Nvidia, AMD and Intel GPUs. The cost of the coarse space is made sub-dominant in this algorithm, and benchmarks from the Frontier supercomputer system show up to a factor of twenty speed up over the standard red-black preconditioned conjugate gradient algorithm on a large system with physical quark masses.
