CUBE2: A Parallel $N$-Body Simulation Code for Scalability, Accuracy, and Memory Efficiency
Hao-Ran Yu, Bing-Hang Chen, Kun Xu, Ming-Jie Sheng, Jiaxin Han, Yipeng Jing, Huahua Cui
TL;DR
Cube2 introduces a parallel cosmological N-body code designed for high memory efficiency, accuracy, and scalability. It achieves this with a multi-level PM/PP gravity solver, optimized Green's functions, and a fixed-point IOS data format to minimize memory while preserving force accuracy. The architecture combines global PM1 with local PM2/PM3 and PP corrections, enabled by a hierarchical cubic spatial decomposition and coarray FORTRAN communication, yielding near-linear scaling on large HPC systems. Demonstrations on ACECS with $N=6144^3$ particles show close agreement with nonlinear predictions and strong/weak scalability up to thousands of cores. The work highlights Cube2 as a practical tool for massive cosmological simulations and points to future extensions, including neutrino modules and heterogeneous architectures.
Abstract
$N$-body simulation serves as a critical method for modeling cosmic evolution and represents a significant challenge in high-performance computing. We present CUBE2, a cosmological $N$-body code emphasizing memory efficiency, computational performance, scalability and precision. The core of its algorithm utilizes Particle-Mesh (PM) method to solve the Poisson equation for matter distribution, leveraging the well-optimized Fast Fourier Transform (FFT) for computational efficiency. In terms of scalability, the multi-level PM spatial decomposition reduces the computational complexity to nearly linear. Precision is ensured by the optimized Green's function that seamlessly bridges gravitational interactions between multi-level PM and Particle-Particle (PP) calculations. The program design enhances per-core/node efficiency in processing $N$-body particles, while a fixed-point data storage format addresses memory constraints for large particle counts. Using CUBE2, we run two cosmological simulations with particle counts of $6144^3$ on the Advanced Computing East China Sub-center (ACECS) to test performance and accuracy.
