Table of Contents
Fetching ...

NDFT: Accelerating Density Functional Theory Calculations via Hardware/Software Co-Design on Near-Data Computing System

Qingcai Jiang, Buxin Tu, Xiaoyu Hao, Junshi Chen, Hong An

TL;DR

This paper tackles the data movement bottleneck in LR-TDDFT by proposing NDFT, a near-data density functional theory framework that co-designs software and hardware on a CPU-NDP system. It introduces a cost-aware, function-level offloading strategy and a hardware-software optimization of pseudopotential handling to reduce memory usage and communication overhead. The approach yields up to 5.2x speedups over CPU and 2.5x over GPU on large systems, with significant improvements in memory footprint and robustness against OOM. Overall, NDFT demonstrates strong scalability across system sizes and highlights practical benefits for large-scale excited-state calculations in materials science and quantum chemistry.

Abstract

Linear-response time-dependent Density Functional Theory (LR-TDDFT) is a widely used method for accurately predicting the excited-state properties of physical systems. Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture. However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}. Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, despite the fact that the calculation steps are fundamentally \textit{memory-bound}. To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework. Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system. Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system. Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.

NDFT: Accelerating Density Functional Theory Calculations via Hardware/Software Co-Design on Near-Data Computing System

TL;DR

This paper tackles the data movement bottleneck in LR-TDDFT by proposing NDFT, a near-data density functional theory framework that co-designs software and hardware on a CPU-NDP system. It introduces a cost-aware, function-level offloading strategy and a hardware-software optimization of pseudopotential handling to reduce memory usage and communication overhead. The approach yields up to 5.2x speedups over CPU and 2.5x over GPU on large systems, with significant improvements in memory footprint and robustness against OOM. Overall, NDFT demonstrates strong scalability across system sizes and highlights practical benefits for large-scale excited-state calculations in materials science and quantum chemistry.

Abstract

Linear-response time-dependent Density Functional Theory (LR-TDDFT) is a widely used method for accurately predicting the excited-state properties of physical systems. Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture. However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}. Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, despite the fact that the calculation steps are fundamentally \textit{memory-bound}. To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework. Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system. Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system. Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Computation flowchart of LR-TDDFT. $\psi_{i_{v}}(\mathbf{r}) \text{ and } \psi_{i_{c}}(\mathbf{r})$ stand for the valence and conduction orbitals in real space ($\left\{\mathbf{r}_{\mathbf{i}}\right\}_{i=1}^{N_{r}}$.
  • Figure 2: An example of 3D-stacked memory.
  • Figure 3: A high-level CPU-NDP architecture.
  • Figure 4: Roofline model analysis of LR-TDDFT kernels across two different system sizes.
  • Figure 5: The data structure optimization to eliminate the data redundancy of pseudopotential.
  • ...and 3 more figures