M2L Translation Operators for Kernel Independent Fast Multipole Methods on Modern Architectures
Srinath Kailasa, Timo Betcke, Sarah El Kazdadi
TL;DR
The paper tackles speeding M2L translations in kernel-independent Fast Multipole Methods on modern architectures by proposing a BLAS-based M2L with randomized low-rank compression, offering portability and simpler implementation compared to FFT-based M2L. It demonstrates that, for the Laplace kernel, the blas_m2l approach can match or exceed FFT-based performance in high-accuracy, static-particle scenarios due to higher arithmetic intensity, at the cost of longer setup times, which can be amortized. Through a Rust-based implementation and benchmarks on Apple M1 Pro and AMD 3790X, the work provides a nuanced, architecture-dependent trade-off analysis and shows that FFT-M2L remains advantageous at low accuracy or dynamic workloads, while BLAS-M2L shines for high-accuracy, reusable setups. The study highlights practical implications for deploying kiFMM on contemporary CPUs and motivates exploring GPU-oriented batched-BLAS implementations.
Abstract
Hardware trends favor algorithm designs that maximize data reuse per FLOP. We develop and benchmark high-performance Multipole-to-Local (M2L) translation operators for the kernel-independent Fast Multipole Method (kiFMM), a widely adopted FMM variant that supports a broad class of kernels and has been favored by recent implementations for its simple specification. Naively implemented, M2L is bandwidth-limited and therefore a key bottleneck in the FMM. State-of-the-art FFT-based M2L implementations, though elegant and with a fast setup time, suffer from low operational intensity and require architecture-specific optimizations. We demonstrate that a BLAS-based M2L, combined with randomized low-rank compression, achieves competitive performance with greater portability and a simpler implementation leveraging existing BLAS infrastructure, at the cost of higher setup times-especially for high-accuracy settings in double precision. Our Rust-based implementation enables seamless switching between strategies for fair benchmarking. Results on CPUs show that FFT-based M2L is favorable in low-accuracy settings or dynamic particle simulations, while BLAS-based M2L is favored for high-accuracy settings for static particle distributions, where its higher setup costs are amortized in many practical applications of the FMM.
