Table of Contents
Fetching ...

Matrix-free algorithms for fast ab initio calculations on distributed CPU architectures using finite-element discretization

Gourab Panigrahi, Phani Motamarri

TL;DR

The paper tackles the memory- and bandwidth-bound bottleneck of repeated FE-DFT operator applications in iterative KS-DFT eigensolvers by introducing matrix-free operator evaluations that exploit the tensor-product structure of high-order FE bases. It presents a unified multilevel batched data layout that handles real and complex arithmetic, enabling on-the-fly tensor contractions for the local and nonlocal DFT operator components and integrates seamlessly with mixed-precision eigensolvers such as R-ChFSI. Across pseudopotential and all-electron, non-periodic and periodic systems on Frontier, Param Pravega, and Fugaku, the matrix-free implementation achieves 1.2–5.8× speedups over the state-of-the-art cell-matrix baseline and demonstrates favorable roofline characteristics, highlighting memory-bound yet cache-efficient gains. The results show substantial reductions in end-to-end time-to-solution for large-scale KS-DFT problems and establish a scalable framework that can extend to heterogeneous architectures and GPU acceleration, promising broader applicability to tensor-product FE PDE solvers. The work underscores the importance of data layout, even-odd symmetry exploitation, and blended real/complex batching in achieving high performance on modern HPC systems while preserving chemical accuracy in energies and forces.

Abstract

Finite-element (FE) discretisations have emerged as a powerful real-space alternative to large-scale Kohn-Sham density functional theory (DFT) calculations, offering systematic convergence, excellent parallel scalability, while accommodating generic boundary conditions. However, the dominant computational bottleneck in FE-based DFT arises from the repeated application of the discretised sparse Hamiltonian to large blocks of trial vectors during iterations in an iterative eigensolver. Traditional sparse matrix-vector multiplications and FE cell-matrix approaches encounter memory limitations and high data-movement overheads, particularly at higher polynomial orders, typically used in DFT calculations. To overcome these challenges, this work develops matrix-free algorithms for FE-discretised DFT that substantially accelerate these products by doing on-the-fly operations that utilize structured tensor contractions over 1D basis functions and quadrature data. A unified multilevel batched data layout that handles both real and complex-valued operators is introduced to maximise cache reuse and SIMD utilisation on Frontier (AVX2), Param Pravega (AVX512) and Fugaku (SVE). We also combine terms for optimal cache reuse, even-odd decomposition to reduce FLOP, and mixed-precision intrinsics. Extensive benchmarks show that for large multivector pseudopotential DFT calculations, the matrix-free kernels deliver 1.5-4x speedups over the state-of-the-art cell-matrix approach baselines. For all-electron DFT calculations, the matrix-free operator achieves gains of up to 5.8x due to its efficient implementation and superior arithmetic intensity. When integrated with an error-tolerant Chebyshev-filtered subspace iteration eigensolver, the matrix-free formalism yields substantial reductions in end-to-end time-to-solution using FE meshes that deliver desired accuracies in ground-state properties.

Matrix-free algorithms for fast ab initio calculations on distributed CPU architectures using finite-element discretization

TL;DR

The paper tackles the memory- and bandwidth-bound bottleneck of repeated FE-DFT operator applications in iterative KS-DFT eigensolvers by introducing matrix-free operator evaluations that exploit the tensor-product structure of high-order FE bases. It presents a unified multilevel batched data layout that handles real and complex arithmetic, enabling on-the-fly tensor contractions for the local and nonlocal DFT operator components and integrates seamlessly with mixed-precision eigensolvers such as R-ChFSI. Across pseudopotential and all-electron, non-periodic and periodic systems on Frontier, Param Pravega, and Fugaku, the matrix-free implementation achieves 1.2–5.8× speedups over the state-of-the-art cell-matrix baseline and demonstrates favorable roofline characteristics, highlighting memory-bound yet cache-efficient gains. The results show substantial reductions in end-to-end time-to-solution for large-scale KS-DFT problems and establish a scalable framework that can extend to heterogeneous architectures and GPU acceleration, promising broader applicability to tensor-product FE PDE solvers. The work underscores the importance of data layout, even-odd symmetry exploitation, and blended real/complex batching in achieving high performance on modern HPC systems while preserving chemical accuracy in energies and forces.

Abstract

Finite-element (FE) discretisations have emerged as a powerful real-space alternative to large-scale Kohn-Sham density functional theory (DFT) calculations, offering systematic convergence, excellent parallel scalability, while accommodating generic boundary conditions. However, the dominant computational bottleneck in FE-based DFT arises from the repeated application of the discretised sparse Hamiltonian to large blocks of trial vectors during iterations in an iterative eigensolver. Traditional sparse matrix-vector multiplications and FE cell-matrix approaches encounter memory limitations and high data-movement overheads, particularly at higher polynomial orders, typically used in DFT calculations. To overcome these challenges, this work develops matrix-free algorithms for FE-discretised DFT that substantially accelerate these products by doing on-the-fly operations that utilize structured tensor contractions over 1D basis functions and quadrature data. A unified multilevel batched data layout that handles both real and complex-valued operators is introduced to maximise cache reuse and SIMD utilisation on Frontier (AVX2), Param Pravega (AVX512) and Fugaku (SVE). We also combine terms for optimal cache reuse, even-odd decomposition to reduce FLOP, and mixed-precision intrinsics. Extensive benchmarks show that for large multivector pseudopotential DFT calculations, the matrix-free kernels deliver 1.5-4x speedups over the state-of-the-art cell-matrix approach baselines. For all-electron DFT calculations, the matrix-free operator achieves gains of up to 5.8x due to its efficient implementation and superior arithmetic intensity. When integrated with an error-tolerant Chebyshev-filtered subspace iteration eigensolver, the matrix-free formalism yields substantial reductions in end-to-end time-to-solution using FE meshes that deliver desired accuracies in ground-state properties.

Paper Structure

This paper contains 32 sections, 33 equations, 10 figures, 2 tables, 4 algorithms.

Figures (10)

  • Figure 1: Pictorial depiction of the multilevel batched layout described in \ref{['sec:multiBatch']}
  • Figure 2: Speedups of our matrix-free implementation strategies for Al nanoparticle (147 and 561 atoms with ${\sim}10k$ DoFs/atom and ${\sim}7k$ DoFs/atom respectively) non-periodic system with pseudopotential DFT operator on Frontier (Left: 2 nodes, Right: 4 nodes), Param Pravega (Left: 2 nodes, Right: 4 nodes) and Fugaku (Left: 20 nodes, Right: 70 nodes) supercomputers.
  • Figure 3: Speedups of our matrix-free implementation strategies for BCC molybdenum (127 and 1023 atoms with ${\sim}4k$ DoFs/atom) periodic system with pseudopotential DFT complex-valued operator (2x2x2 k-point rule) on Frontier (Left: 2 nodes, Right: 12 nodes), Param Pravega (Left: 2 nodes, Right: 10 nodes) and Fugaku (Left: 20 nodes, Right: 200 nodes) supercomputers.
  • Figure 4: Speedups of our matrix-free implementation strategies for benzamide (real operator, 16 atoms, ${\sim}240k$ DoFs/atom) and BCC lithium (complex operator, 2x2x2 k-point rule, 15 atoms, ${\sim}140k$ DoFs/atom) with all-electron DFT operator on 2 nodes of Frontier, 2 nodes of Param Pravega and 20 nodes of Fugaku supercomputers respectively.
  • Figure 5: Roofline analysis of our matrix-free implementation and the cell-matrix implementation for Al nanoparticle (147 atoms) and benzamide (16 atoms) systems with $\texttt{FEOrder} = 8$ for Real-valued DFT operator on Param Pravega supercomputer.
  • ...and 5 more figures