Table of Contents
Fetching ...

M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures

Srinath Kailasa, Timo Betcke, Sarah El Kazdadi

TL;DR

The paper tackles speeding M2L translations in kernel-independent Fast Multipole Methods on modern architectures by proposing a BLAS-based M2L with randomized low-rank compression, offering portability and simpler implementation compared to FFT-based M2L. It demonstrates that, for the Laplace kernel, the blas_m2l approach can match or exceed FFT-based performance in high-accuracy, static-particle scenarios due to higher arithmetic intensity, at the cost of longer setup times, which can be amortized. Through a Rust-based implementation and benchmarks on Apple M1 Pro and AMD 3790X, the work provides a nuanced, architecture-dependent trade-off analysis and shows that FFT-M2L remains advantageous at low accuracy or dynamic workloads, while BLAS-M2L shines for high-accuracy, reusable setups. The study highlights practical implications for deploying kiFMM on contemporary CPUs and motivates exploring GPU-oriented batched-BLAS implementations.

Abstract

Hardware trends favor algorithm designs that maximize data reuse per FLOP. We develop and benchmark high-performance Multipole-to-Local (M2L) translation operators for the kernel-independent Fast Multipole Method (kiFMM), a widely adopted FMM variant that supports a broad class of kernels and has been favored by recent implementations for its simple specification. Naively implemented, M2L is bandwidth-limited and therefore a key bottleneck in the FMM. State-of-the-art FFT-based M2L implementations, though elegant and with a fast setup time, suffer from low operational intensity and require architecture-specific optimizations. We demonstrate that a BLAS-based M2L, combined with randomized low-rank compression, achieves competitive performance with greater portability and a simpler implementation leveraging existing BLAS infrastructure, at the cost of higher setup times-especially for high-accuracy settings in double precision. Our Rust-based implementation enables seamless switching between strategies for fair benchmarking. Results on CPUs show that FFT-based M2L is favorable in low-accuracy settings or dynamic particle simulations, while BLAS-based M2L is favored for high-accuracy settings for static particle distributions, where its higher setup costs are amortized in many practical applications of the FMM.

M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures

TL;DR

The paper tackles speeding M2L translations in kernel-independent Fast Multipole Methods on modern architectures by proposing a BLAS-based M2L with randomized low-rank compression, offering portability and simpler implementation compared to FFT-based M2L. It demonstrates that, for the Laplace kernel, the blas_m2l approach can match or exceed FFT-based performance in high-accuracy, static-particle scenarios due to higher arithmetic intensity, at the cost of longer setup times, which can be amortized. Through a Rust-based implementation and benchmarks on Apple M1 Pro and AMD 3790X, the work provides a nuanced, architecture-dependent trade-off analysis and shows that FFT-M2L remains advantageous at low accuracy or dynamic workloads, while BLAS-M2L shines for high-accuracy, reusable setups. The study highlights practical implications for deploying kiFMM on contemporary CPUs and motivates exploring GPU-oriented batched-BLAS implementations.

Abstract

Hardware trends favor algorithm designs that maximize data reuse per FLOP. We develop and benchmark high-performance Multipole-to-Local (M2L) translation operators for the kernel-independent Fast Multipole Method (kiFMM), a widely adopted FMM variant that supports a broad class of kernels and has been favored by recent implementations for its simple specification. Naively implemented, M2L is bandwidth-limited and therefore a key bottleneck in the FMM. State-of-the-art FFT-based M2L implementations, though elegant and with a fast setup time, suffer from low operational intensity and require architecture-specific optimizations. We demonstrate that a BLAS-based M2L, combined with randomized low-rank compression, achieves competitive performance with greater portability and a simpler implementation leveraging existing BLAS infrastructure, at the cost of higher setup times-especially for high-accuracy settings in double precision. Our Rust-based implementation enables seamless switching between strategies for fair benchmarking. Results on CPUs show that FFT-based M2L is favorable in low-accuracy settings or dynamic particle simulations, while BLAS-based M2L is favored for high-accuracy settings for static particle distributions, where its higher setup costs are amortized in many practical applications of the FMM.
Paper Structure (18 sections, 50 equations, 6 figures, 16 tables)

This paper contains 18 sections, 50 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: We illustrate the surfaces, and their associated discretization points, required to construct multipole and local expansions from source points, shown in green. This is for a problem in $\mathbb{R}^3$, where we have taken $P=4$. We show cross sections of cubic surfaces, as in Figure 4 of Ying2004, where source points are shown as green points. The arrows illustrate the steps of the calculation: (1) compute the check potential from the source points, (2) calculate the equivalent densities using the check potentials.
  • Figure 2: We illustrate the surfaces, and their associated discretization points, required to perform each field translation, m2m, m2l and l2l. This is for a problem in $\mathbb{R}^3$, where we have taken $P=4$. We show cross sections of cubic surfaces, as in Figure 5 of Ying2004. We show how for the m2l, the source box's equivalent surface and the target box's check surface can be seen to originate from a regular Cartesian grid, enabling the acceleration in the computation of the check potential in this case via the fft. The arrows illustrate the flow of the calculation: (1) compute the check potential, (2) evaluate the equivalent density using the check potential.
  • Figure 3: The Laplace potential is computed in double precision such that the error, calculated using \ref{['eq:l2_error']}, is $\mathcal{O}\left(\varepsilon\right) = 10^{-10}$. We compare fft_m2l and blas_m2l on a highly non-uniform point distribution consisting of 1,441,572 vertices from 480,524 triangles, each with unit source density. The octree is uniformly refined to a depth of $d=5$, producing 2,184 non-empty leaf boxes. The same set of points is used as both sources and targets. fft_m2l is run with equivalent surface order $P_e = 10$ and a batch size of 16. blas_m2l is configured with $P_e = 10$, check surface order $P_c = 11$, an singular value threshold of $1 \times 10^{-11}$ below which associated singular vectors in the compressed m2l matrices are cutoff, and 20 oversamples in the rsvd. The mean runtime on this non-uniform dataset is 5,779 ms for fft_m2l and 5,176 ms for blas_m2l. For comparison, the same number of uniformly distributed source and target points (with identical FMM parameters and octree depth $d = 4$ yielding a similar level of accuracy - the lower level of refinement used to better handle the uniform distribution) yield runtimes of 3,933 ms (fft_m2l) and 4,749 ms (blas_m2l). Thus, the non-uniformity incurs a $\sim$ 47% runtime increase for fft_m2l and $\sim$ 8% for blas_m2l compared to the evaluation over a uniform point distribution which can be interpreted as a baseline. The histogram illustrates the uneven point distribution, with leaf boxes containing between 1 and $10^5$ points. Geometry from thingiverse_3253610.
  • Figure 4: Source and target clusters illustrated in two dimensions. Here a target cluster consisting of four sibling quadrants is shown in ping, and the eight source clusters, which consist of the target cluster's parent's neighbours are shown in blue.
  • Figure 5: We show a one dimensional M2L translation for $p=3$ expansions from the red source box $\{y_j\}_{j=0}^2$, embedded in a convolution grid $\{ \tilde{y}_j \}_{j=0}^{2P-1}$, with associated multipole expansion coefficients $\{q_j\}_{j=0}^2$ to a blue target box $\{x_i\}_{i=0}^2$. We associate the required sequences of kernel evaluations $K[.]$, and the flipped sequence $K'[.]$, and the sequence of source densities $\tilde{q}[.]$ associated with points on the convolution grid
  • ...and 1 more figures