Table of Contents
Fetching ...

FlashMP: Fast Discrete Transform-Based Solver for Preconditioning Maxwell's Equations on GPUs

Haoyuan Zhang, Yaqian Gao, Xinxin Zhang, Jialin Li, Runfeng Jin, Yidong Chen, Feng Zhang, Wu Yuan, Wenpeng Ma, Shan Liang, Jian Zhang, Zhonghua Lu

TL;DR

This work tackles the challenge of solving large-scale CN-FDTD Maxwell systems dominated by an ill-conditioned double-curl operator. It introduces FlashMP, a transform-based subdomain exact solver that uses a discrete transform derived from the SVD of the forward-difference operator to diagonalize and decouple the system, combined with domain decomposition and a low-rank boundary correction via the Woodbury formula. The approach yields asymptotic computational and memory efficiency of $O(n^4)$, a stark reduction from direct methods’ $O(n^6)$, and delivers up to 16x fewer iterations and 2.5x–4.9x speedups on multi-GPU AMD clusters, with parallel efficiency reaching up to $84.1\%$ at 1000 GPUs. These results demonstrate the method’s practical impact for scalable electromagnetic simulations on large GPU infrastructures.

Abstract

Efficiently solving large-scale linear systems is a critical challenge in electromagnetic simulations, particularly when using the Crank-Nicolson Finite-Difference Time-Domain (CN-FDTD) method. Existing iterative solvers are commonly employed to handle the resulting sparse systems but suffer from slow convergence due to the ill-conditioned nature of the double-curl operator. Approximate preconditioners, like Successive Over-Relaxation (SOR) and Incomplete LU decomposition (ILU), provide insufficient convergence, while direct solvers are impractical due to excessive memory requirements. To address this, we propose FlashMP, a novel preconditioning system that designs a subdomain exact solver based on discrete transforms. FlashMP provides an efficient GPU implementation that achieves multi-GPU scalability through domain decomposition. Evaluations on AMD MI60 GPU clusters (up to 1000 GPUs) show that FlashMP reduces iteration counts by up to 16x and achieves speedups of 2.5x to 4.9x compared to baseline implementations in state-of-the-art libraries such as Hypre. Weak scalability tests show parallel efficiencies up to 84.1%.

FlashMP: Fast Discrete Transform-Based Solver for Preconditioning Maxwell's Equations on GPUs

TL;DR

This work tackles the challenge of solving large-scale CN-FDTD Maxwell systems dominated by an ill-conditioned double-curl operator. It introduces FlashMP, a transform-based subdomain exact solver that uses a discrete transform derived from the SVD of the forward-difference operator to diagonalize and decouple the system, combined with domain decomposition and a low-rank boundary correction via the Woodbury formula. The approach yields asymptotic computational and memory efficiency of , a stark reduction from direct methods’ , and delivers up to 16x fewer iterations and 2.5x–4.9x speedups on multi-GPU AMD clusters, with parallel efficiency reaching up to at 1000 GPUs. These results demonstrate the method’s practical impact for scalable electromagnetic simulations on large GPU infrastructures.

Abstract

Efficiently solving large-scale linear systems is a critical challenge in electromagnetic simulations, particularly when using the Crank-Nicolson Finite-Difference Time-Domain (CN-FDTD) method. Existing iterative solvers are commonly employed to handle the resulting sparse systems but suffer from slow convergence due to the ill-conditioned nature of the double-curl operator. Approximate preconditioners, like Successive Over-Relaxation (SOR) and Incomplete LU decomposition (ILU), provide insufficient convergence, while direct solvers are impractical due to excessive memory requirements. To address this, we propose FlashMP, a novel preconditioning system that designs a subdomain exact solver based on discrete transforms. FlashMP provides an efficient GPU implementation that achieves multi-GPU scalability through domain decomposition. Evaluations on AMD MI60 GPU clusters (up to 1000 GPUs) show that FlashMP reduces iteration counts by up to 16x and achieves speedups of 2.5x to 4.9x compared to baseline implementations in state-of-the-art libraries such as Hypre. Weak scalability tests show parallel efficiencies up to 84.1%.

Paper Structure

This paper contains 20 sections, 32 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Steps of subdomain exact solving with discrete transform and low-rank correction.
  • Figure 2: Illustration of Low-Rank Boundary Correction. (a) Left: Why low-rank? Only surfaces matter, not volume. Total non-zeros: $2n^2 - n$ for the $x$ component. (a) Right: $\mathbf{\Lambda}$'s sparse diagonal with non-zeros (e.g., 2 at edges, 1 at faces) at boundary indices. (b) Low-rank decomposition $\mathbf{Q} \mathbf{W} \mathbf{Q}^T$, from a tall-skinny $\mathbf{Q}$ to a small $\mathbf{W}$, yielding a rank $\leq 6n^2 - 3n \ll 3n^3$, followed by the application of the Woodbury formula.
  • Figure 3: Tensor product operations of a field component $R_x$ along the $x$, $y$, and $z$ directions based on DGEMM.
  • Figure 4: Inter-subdomain communication. (a) Data derived from 26 adjacent subdomains are distinguished by different colors. The block in the middle represents the data that the subdomain originally had. (b) asm_comm represents the communication process involved in domain decomposition, including the three steps: Pack, Send & Recv, and Unpack.
  • Figure 5: Convergence curves of BiCGSTAB (a) and GMRES (b) with different preconditioners, where "NOPRE" represents without preconditioner, "OL_i" represents the use of the FlashMP with overlap $i$, and "ILU", "IC", "SOR" represent incomplete LU, incomplete Cholesky, and successive over-relaxation, respectively. The Y axis is the relative residual, and the X axis is the iteration number.
  • ...and 2 more figures