Table of Contents
Fetching ...

CUBE2: A Parallel $N$-Body Simulation Code for Scalability, Accuracy, and Memory Efficiency

Hao-Ran Yu, Bing-Hang Chen, Kun Xu, Ming-Jie Sheng, Jiaxin Han, Yipeng Jing, Huahua Cui

TL;DR

Cube2 introduces a parallel cosmological N-body code designed for high memory efficiency, accuracy, and scalability. It achieves this with a multi-level PM/PP gravity solver, optimized Green's functions, and a fixed-point IOS data format to minimize memory while preserving force accuracy. The architecture combines global PM1 with local PM2/PM3 and PP corrections, enabled by a hierarchical cubic spatial decomposition and coarray FORTRAN communication, yielding near-linear scaling on large HPC systems. Demonstrations on ACECS with $N=6144^3$ particles show close agreement with nonlinear predictions and strong/weak scalability up to thousands of cores. The work highlights Cube2 as a practical tool for massive cosmological simulations and points to future extensions, including neutrino modules and heterogeneous architectures.

Abstract

$N$-body simulation serves as a critical method for modeling cosmic evolution and represents a significant challenge in high-performance computing. We present CUBE2, a cosmological $N$-body code emphasizing memory efficiency, computational performance, scalability and precision. The core of its algorithm utilizes Particle-Mesh (PM) method to solve the Poisson equation for matter distribution, leveraging the well-optimized Fast Fourier Transform (FFT) for computational efficiency. In terms of scalability, the multi-level PM spatial decomposition reduces the computational complexity to nearly linear. Precision is ensured by the optimized Green's function that seamlessly bridges gravitational interactions between multi-level PM and Particle-Particle (PP) calculations. The program design enhances per-core/node efficiency in processing $N$-body particles, while a fixed-point data storage format addresses memory constraints for large particle counts. Using CUBE2, we run two cosmological simulations with particle counts of $6144^3$ on the Advanced Computing East China Sub-center (ACECS) to test performance and accuracy.

CUBE2: A Parallel $N$-Body Simulation Code for Scalability, Accuracy, and Memory Efficiency

TL;DR

Cube2 introduces a parallel cosmological N-body code designed for high memory efficiency, accuracy, and scalability. It achieves this with a multi-level PM/PP gravity solver, optimized Green's functions, and a fixed-point IOS data format to minimize memory while preserving force accuracy. The architecture combines global PM1 with local PM2/PM3 and PP corrections, enabled by a hierarchical cubic spatial decomposition and coarray FORTRAN communication, yielding near-linear scaling on large HPC systems. Demonstrations on ACECS with particles show close agreement with nonlinear predictions and strong/weak scalability up to thousands of cores. The work highlights Cube2 as a practical tool for massive cosmological simulations and points to future extensions, including neutrino modules and heterogeneous architectures.

Abstract

-body simulation serves as a critical method for modeling cosmic evolution and represents a significant challenge in high-performance computing. We present CUBE2, a cosmological -body code emphasizing memory efficiency, computational performance, scalability and precision. The core of its algorithm utilizes Particle-Mesh (PM) method to solve the Poisson equation for matter distribution, leveraging the well-optimized Fast Fourier Transform (FFT) for computational efficiency. In terms of scalability, the multi-level PM spatial decomposition reduces the computational complexity to nearly linear. Precision is ensured by the optimized Green's function that seamlessly bridges gravitational interactions between multi-level PM and Particle-Particle (PP) calculations. The program design enhances per-core/node efficiency in processing -body particles, while a fixed-point data storage format addresses memory constraints for large particle counts. Using CUBE2, we run two cosmological simulations with particle counts of on the Advanced Computing East China Sub-center (ACECS) to test performance and accuracy.

Paper Structure

This paper contains 16 sections, 15 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Reference force and its decomposition. Upper panel: the total reference force ($b_{\rm PP}=0.06$) is decomposed into four components, each calculated by Cube2 (discussed in \ref{['sec.Force accuracy']}), with shaded regions denote force errors. Middle and lower panels: magnitude residual and directional error between computed force sum $\bm F$ and true reference force $\bm R$. In all three panels, inner error regions indicate $1\sigma$ standard deviation, and outer regions indicate the minima to maxima of the distribution.
  • Figure 2: Optimized Green's function $G_2$ (upper triangle, symmetric between $k_1$ and $k_2$) and $k^{-2}$ (lower triangle) for comparison, on the $k_1k_2$-plane with $k_3=0$.
  • Figure 3: Illustration of the Cube2 volume decomposition hierarchy. For each dimension, the box size contains 2 nodes (gray), each node contains 2 tiles (yellow), and each tile contains 4 subtiles (red). The color variations in subtiles indicate different PM3 grid resolutions.
  • Figure 4: Time consumption comparisons of the adaptive PM3+PP algorithm across refinement levels: PM3 (dashed curves), PP (dotted curves), and total (solid curves) for different resolutions are shown by different colors.
  • Figure 5: Schematic diagram of load balancing. Left panel: parallel processing in tile order. Right panel: parallel processing after sorting tiles by task size in descending order. The optimized approach reduces load imbalance by $11.4\%$ and achieves near-perfect load balancing at $99.6\%$.
  • ...and 3 more figures