Table of Contents
Fetching ...

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

Swapnil Mache, Ivo M. Vellekoop

TL;DR

This work addresses the memory bottleneck of the Modified Born Series (MBS) for large-scale wave propagation by introducing a non-overlapping domain decomposition that distributes computations across multiple GPUs. The method preserves the MBS advantages—low memory usage, high accuracy, and monotonic convergence—while enabling larger problems through local subdomain convolutions and minimal inter-subdomain communication. The authors demonstrate substantial scalability, achieving a $3.27\cdot10^{7}$-wavelength 3D Helmholtz problem on two GPUs in 45 minutes, and show favorable performance up to four GPUs with manageable communication overhead. The approach is implemented in open-source Python, with strong potential extensions to Maxwell's equations and birefringent media, broadening applicability to optical and seismic wave simulations.

Abstract

The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In the current implementation of the MBS, the simulation size is limited by the working memory of a single computer or graphics processing unit (GPU). Here, we present a domain decomposition method that enhances the scalability of the MBS by distributing the computations over multiple GPUs, while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. With this new method, the computations can be performed in parallel, and a larger simulation size is possible as it is no longer limited to the memory size of a single computer or GPU. We show how to decompose large problems over subdomains and demonstrate our approach by solving the Helmholtz problem for a complex structure of $3.28\cdot 10^7$ cubic wavelengths ($320 \times 320 \times 320$ wavelengths) in just $45$ minutes with a dual-GPU simulation.

Domain decomposition of the modified Born series approach for large-scale wave propagation simulations

TL;DR

This work addresses the memory bottleneck of the Modified Born Series (MBS) for large-scale wave propagation by introducing a non-overlapping domain decomposition that distributes computations across multiple GPUs. The method preserves the MBS advantages—low memory usage, high accuracy, and monotonic convergence—while enabling larger problems through local subdomain convolutions and minimal inter-subdomain communication. The authors demonstrate substantial scalability, achieving a -wavelength 3D Helmholtz problem on two GPUs in 45 minutes, and show favorable performance up to four GPUs with manageable communication overhead. The approach is implemented in open-source Python, with strong potential extensions to Maxwell's equations and birefringent media, broadening applicability to optical and seismic wave simulations.

Abstract

The modified Born series (MBS) is a fast and accurate method for simulating wave propagation in complex structures. In the current implementation of the MBS, the simulation size is limited by the working memory of a single computer or graphics processing unit (GPU). Here, we present a domain decomposition method that enhances the scalability of the MBS by distributing the computations over multiple GPUs, while maintaining its accuracy, memory efficiency, and guaranteed monotonic convergence. With this new method, the computations can be performed in parallel, and a larger simulation size is possible as it is no longer limited to the memory size of a single computer or GPU. We show how to decompose large problems over subdomains and demonstrate our approach by solving the Helmholtz problem for a complex structure of cubic wavelengths ( wavelengths) in just minutes with a dual-GPU simulation.
Paper Structure (15 sections, 9 equations, 9 figures, 2 tables)

This paper contains 15 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Convergence of the MBS for a problem of size $50\times50\times50$ wavelengths sampled at $4$ points per wavelength (details in Sections \ref{['sec:results']} and \ref{['subsec:results_validation']}) for different refractive index values of the scattering medium (the background refractive index is kept $1$). The real part ranges from $1$ to $5$, and the imaginary part (extinction coefficient) from $0$ to $9$. The (a) number of iterations and (b) simulation time to converge to a relative residual of $10^{-6}$ increase with scattering contrast as expected.
  • Figure 2: Matrix representation of splitting operator $A$ into $L + V$ for a 1D system. (a) Operator $A$ is split into (b) $L$ containing the Laplacian operator, a band along the diagonal, and (c) $V$, a diagonal matrix containing the scattering potential. (d) The operator $L$ with wraparound artefacts resulting from implementing $L$ as a fast convolution (Eq. \ref{['eq:L-convolution']}).
  • Figure 3: Matrix representation of the block decomposition of Eq. \ref{['eq:LV']}. (a) $A$ (same as Fig. \ref{['fig:splitting']}a) is decomposed over two subdomains, shown by the dashed grey lines. (b) Decomposition of $L$, implemented as a fast convolution (Eq. \ref{['eq:L-convolution']}) over each subdomain, thus containing wraparound artefacts. The off-diagonal (communication) blocks are empty. (c) $V \coloneqq A - L$ has diagonal blocks containing the scattering potential on the diagonal elements and wrapping artefacts from $L$ but with the opposite sign, and off-diagonal blocks representing communication between the subdomains.
  • Figure 4: Generation of the off-diagonal blocks $A_{12}$ and $A_{21}$ using the angular spectrum kernel. $A_{21}=A_{12}^T$, so we only show $A_{12}$. (a) Field after convolution of the Laplacian with a point source. (b) Field when the fast convolution is computed only over Subdomain 1 (white background). The difference between the fields in (a) and (b) is shown by the dashed red lines and gives columns of (c) the (truncated) block $A_{12}$.
  • Figure 5: Using truncated blocks in $V$ to reconstruct $A$. (a) Matrix representation of $A$ decomposed over two subdomains (demarcated by the dashed grey lines), but with truncated blocks. (b) $L$ decomposed over the subdomains remains unchanged and is the same as in Fig.\ref{['fig:decompose']}b. (c) $V$ with truncated blocks for communication (off-diagonal) and removing wrapping artefacts (off-diagonal elements of the diagonal blocks), visible as square blocks of side $t=6$ in the corners of subdomain blocks. Note that even though the colour scale is logarithmic in both the positive and negative directions from 0, the blocks are hardly visible, showing that the effect of the truncation is small.
  • ...and 4 more figures