PWDFT-SW: Extending the Limit of Plane-Wave DFT Calculations to 16K Atoms on the New Sunway Supercomputer
Qingcai Jiang, Zhenwei Cao, Junshi Chen, Xinming Qin, Wei Hu, Hong An, Jinlong Yang
TL;DR
This work tackles the prohibitive memory and communication costs of plane-wave DFT on the Sunway supercomputer by introducing PWDFT-SW, a suite of optimizations that refactor pseudopotential handling, optimize memory access, and reorganize inter-process communications. Key innovations include a memory-efficient pseudopotential distribution, in-place data transformations to reduce temporary buffers, a multistage Allreduce tailored to the Sunway network, and granularity-aware parallelization that selects optimal process counts for each step, complemented by SWUC-assisted heterogeneous programming. The result is a 64.8x speedup for a 4,096-atom silicon system and the ability to handle 16,384-atom graphs (e.g., graphene) using a fraction of the original resources, effectively extending PW-based DFT to large-scale systems. These improvements deliver practical, scalable PWDFT on a platform with 16 GB per process memory and offer generalizable strategies for other HPC architectures, enabling broader impact in computational materials science.
Abstract
First-principles density functional theory (DFT) with plane wave (PW) basis set is the most widely used method in quantum mechanical material simulations due to its advantages in accuracy and universality. However, a perceived drawback of PW-based DFT calculations is their substantial computational cost and memory usage, which currently limits their ability to simulate large-scale complex systems containing thousands of atoms. This situation is exacerbated in the new Sunway supercomputer, where each process is limited to a mere 16 GB of memory. Herein, we present a novel parallel implementation of plane wave density functional theory on the new Sunway supercomputer (PWDFT-SW). PWDFT-SW fully extracts the benefits of Sunway supercomputer by extensively refactoring and calibrating our algorithms to align with the system characteristics of the Sunway system. Through extensive numerical experiments, we demonstrate that our methods can substantially decrease both computational costs and memory usage. Our optimizations translate to a speedup of 64.8x for a physical system containing 4,096 silicon atoms, enabling us to push the limit of PW-based DFT calculations to large-scale systems containing 16,384 carbon atoms.
