Table of Contents
Fetching ...

Multiple right hand side multigrid for domain wall fermions with a multigrid preconditioned block conjugate gradient algorithm

Peter A Boyle

TL;DR

This work tackles the critical slowing down of domain wall/Mobius fermion solvers by introducing a multiple right-hand side multigrid approach built on the HDCG framework. It combines a preconditioned BlockCGrQ outer solver with a stationary Chebyshev multigrid preconditioner and two subspace setup strategies (Chebyshev-filtered and Lanczos-derived eigenvectors) to accelerate inversions for several right-hand sides concurrently. The results show substantial performance gains (often exceeding 20x per RHS on large physical-mass lattices) and a sub-dominant coarse-space cost, with robust GPU performance using batched GEMM and Grid infrastructure. The findings demonstrate scalable, high-throughput valence propagator inversions at physical quark masses and offer practical pathways toward faster gauge configurations and broader applicability of MRHS multigrid in lattice QCD computations.

Abstract

We introduce a class of efficient multiple right-hand side multigrid algorithm for domain wall fermions. The simultaneous solution for a modest number of right hand sides concurrently allows for a significant reduction in the time spent solving the coarse grid operator in a multigrid preconditioner. We introduce a preconditioned block conjuate gradient with a multigrid preconditioner, giving additional algorithmic benefit from the multiple right hand sides. There is also a very significant additional to computation rate benefit to multiple right hand sides. This both increases the arithmetic intensity in the coarse space and increases the amount of work being performed in each subroutine call, leading to excellent performance on modern GPU architectures. Further, the software implementation makes use of vendor linear algebra routines (batched GEMM) that can make use of high throughput tensor hardware on recent Nvidia, AMD and Intel GPUs. The cost of the coarse space is made sub-dominant in this algorithm, and benchmarks from the Frontier supercomputer system show up to a factor of twenty speed up over the standard red-black preconditioned conjugate gradient algorithm on a large system with physical quark masses.

Multiple right hand side multigrid for domain wall fermions with a multigrid preconditioned block conjugate gradient algorithm

TL;DR

This work tackles the critical slowing down of domain wall/Mobius fermion solvers by introducing a multiple right-hand side multigrid approach built on the HDCG framework. It combines a preconditioned BlockCGrQ outer solver with a stationary Chebyshev multigrid preconditioner and two subspace setup strategies (Chebyshev-filtered and Lanczos-derived eigenvectors) to accelerate inversions for several right-hand sides concurrently. The results show substantial performance gains (often exceeding 20x per RHS on large physical-mass lattices) and a sub-dominant coarse-space cost, with robust GPU performance using batched GEMM and Grid infrastructure. The findings demonstrate scalable, high-throughput valence propagator inversions at physical quark masses and offer practical pathways toward faster gauge configurations and broader applicability of MRHS multigrid in lattice QCD computations.

Abstract

We introduce a class of efficient multiple right-hand side multigrid algorithm for domain wall fermions. The simultaneous solution for a modest number of right hand sides concurrently allows for a significant reduction in the time spent solving the coarse grid operator in a multigrid preconditioner. We introduce a preconditioned block conjuate gradient with a multigrid preconditioner, giving additional algorithmic benefit from the multiple right hand sides. There is also a very significant additional to computation rate benefit to multiple right hand sides. This both increases the arithmetic intensity in the coarse space and increases the amount of work being performed in each subroutine call, leading to excellent performance on modern GPU architectures. Further, the software implementation makes use of vendor linear algebra routines (batched GEMM) that can make use of high throughput tensor hardware on recent Nvidia, AMD and Intel GPUs. The cost of the coarse space is made sub-dominant in this algorithm, and benchmarks from the Frontier supercomputer system show up to a factor of twenty speed up over the standard red-black preconditioned conjugate gradient algorithm on a large system with physical quark masses.
Paper Structure (23 sections, 24 equations, 13 figures, 6 tables)

This paper contains 23 sections, 24 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Flexible ADEF2 preconditioned conjugate gradient algorithm following Tang et alTang for solving a Hermitian system ${\cal H} x= b$, with an additonal flexible search direction orthogonalisation (step 8). We have implemented "inexact preconditioned CG"GolubYe and "flexible CG" Notay variants of the ADEF2 algorithm to address the variability in the preconditioner when it is composed of Krylov processes. The matrix $Q$ is a coarse grid correction, and the matrix $M_{IRS}({\cal H},\Lambda)$ is a smoother in multigrid nomenclature. The Galerkin projector $P_R$ is defined in the text.
  • Figure 2: We generalise BlockCGrQ Dubrulle2001RetoolingTMbirk2015deflated (left) to derive Preconditioned BlockCGrQ (right).
  • Figure 3: Convergence of the red-black preconditioned conjugate gradient algorithm with physical quark mass on the Mobius domain wall fermion action with the Schur operator ${\cal H}$ and on our $1.7$ GeV $48^3\times 96 \times 24$ (5.5fm) and $96^3\times 192 \times 24$ (11fm) test configurations. The volume of eigenvector data and memory required to deflated the latter configuration is prohibitive as the cost grows as the square of the volume, whereas the multigrid solver setup, cost and footprint is proportional to the volume. The convergence is dictated by the spectrum and since the lattice spacing, quark mass and fermion action are identical between the two volumes (which differ by a factor of 16) the spectral density and convergence history remain almost identical. The system is solved with a polynomial order that is far less than the rank of both systems, which have a dense spectrum, so that each crossing of residual polynomial is covering a large number of eigenvalues in both cases, meaning the worst case bound is a reasonable description.
  • Figure 4: On our $48^3\times 96$ test volume we can assess the completeness of the low mode by computing $E_i = \sqrt{|| (1 - PP^\dagger) | i \rangle ||}$ and $E_i^{smoothed} = \sqrt{|| M_{IRS}(1 - PP^\dagger) | i \rangle ||}$ for each eigenpair, here with the Filter based subspace setup (left) and Lanczos based setup (right). Post-smoothing reduces the error and may be indicative of a relative cheap way to improve coarse operator based low mode variance reduction. The exact eigenvectors included in the coarse basis (right) are faithfully reproduced to numerical precision, while higher modes have a percent scale error.
  • Figure 5: Spectrum of the lowest 200 eigenmodes of the fine operator ${\cal H}$ and the coarse operator with both Lanczos (green) and Chebyshev filter (blue) setup schemes with $62$ near null basis vectors on the $48^3$ lattice. A cluster of exactly $n_{basis}=62$ very low eigenvalues are seen in the coarse operator, corresponding to the maximal diagonalisation of the fine operator within this set of near null vectors, whereas directions that involve a non-trivial coarse coordinate dependence in the coarse eigenvector necessarily incur spectral leakage at the boundaries between blocks and this lifts the coarse eigenvalue by an order of magnitude. The upper eigenvalue of the fine operator is around 89.0 while the upper eigenvalue of the coarse operator is around 37.0. On the right panel we zoom in on the lowest eigenvalues. We see that with the eigenvector setup the lowest eigenvalues are exactly reproduced by the coarse operator. This likely contributes a to a slightly improved convergence rate in section \ref{['sec:solverresults']}.
  • ...and 8 more figures