Flexible Multi-Dimensional FFTs for Plane Wave Density Functional Theory Codes
Doru Thom Popovici, Mauro del Ben, Osni Marques, Andrew Canning
TL;DR
The paper addresses the need for flexible, distributed multi-dimensional FFTs tailored to plane-wave density functional theory codes that operate on batched spherical data. It introduces FFTB, a modular framework with a processing-grid API that supports both cuboid and sphere-based data, enabling batched and non-batched transforms on CPU and GPU backends. The approach fuses local transforms with data movement through a programmable pipeline, achieving superior scalability on HPC systems and reducing redundant padding via staged padding strategies. Experimental results on GPU-accelerated systems demonstrate strong scaling and the practical benefits of batching for plane-wave FFTs, highlighting FFTB’s potential to accelerate plane-wave DFT workflows across diverse architectures. The work offers a path toward integrating flexible FFTs into existing DFT codes and extending support to future HPC platforms, with open-source release planned.
Abstract
Multi-dimensional Fourier transforms are key mathematical building blocks that appear in a wide range of applications from materials science, physics, chemistry and even machine learning. Over the past years, a multitude of software packages targeting distributed multi-dimensional Fourier transforms have been developed. Most variants attempt to offer efficient implementations for single transforms applied on data mapped onto rectangular grids. However, not all scientific applications conform to this pattern, i.e. plane wave Density Functional Theory codes require multi-dimensional Fourier transforms applied on data represented as batches of spheres. Typically, the implementations for this use case are hand-coded and tailored for the requirements of each application. In this work, we present the Fastest Fourier Transform from Berkeley (FFTB) a distributed framework that offers flexible implementations for both regular/non-regular data grids and batched/non-batched transforms. We provide a flexible implementations with a user-friendly API that captures most of the use cases. Furthermore, we provide implementations for both CPU and GPU platforms, showing that our approach offers improved execution time and scalability on the HP Cray EX supercomputer. In addition, we outline the need for flexible implementations for different use cases of the software package.
