Table of Contents
Fetching ...

nbodykit: an open-source, massively parallel toolkit for large-scale structure

Nick Hand, Yu Feng, Florian Beutler, Yin Li, Chirag Modi, Uros Seljak, Zachary Slepian

TL;DR

nbodykit delivers a massively parallel, open-source Python toolkit for large-scale structure analysis, unifying canonical LSS algorithms within a modular, component-based framework built atop MPI. It combines interactive, notebook-friendly usage with HPC scalability through distributed Catalogs and Meshes, along with a comprehensive set of cosmology tools, data readers, and fast estimators for power spectra and correlation functions. The framework emphasizes reproducibility, automated testing, and thorough documentation, demonstrated through an illustrative galaxy clustering emulator and extensive performance benchmarks showing strong scaling on leadership-class systems. By integrating mature external libraries (e.g., Corrfunc, Halotools, FastPM) and providing open pathways for community contributions, nbodykit aims to become a standard foundation for future LSS analyses and tools.

Abstract

We present nbodykit, an open-source, massively parallel Python toolkit for analyzing large-scale structure (LSS) data. Using Python bindings of the Message Passing Interface (MPI), we provide parallel implementations of many commonly used algorithms in LSS. nbodykit is both an interactive and scalable piece of scientific software, performing well in a supercomputing environment while still taking advantage of the interactive tools provided by the Python ecosystem. Existing functionality includes estimators of the power spectrum, 2 and 3-point correlation functions, a Friends-of-Friends grouping algorithm, mock catalog creation via the halo occupation distribution technique, and approximate N-body simulations via the FastPM scheme. The package also provides a set of distributed data containers, insulated from the algorithms themselves, that enable nbodykit to provide a unified treatment of both simulation and observational data sets. nbodykit can be easily deployed in a high performance computing environment, overcoming some of the traditional difficulties of using Python on supercomputers. We provide performance benchmarks illustrating the scalability of the software. The modular, component-based approach of nbodykit allows researchers to easily build complex applications using its tools. The package is extensively documented at http://nbodykit.readthedocs.io, which also includes an interactive set of example recipes for new users to explore. As open-source software, we hope nbodykit provides a common framework for the community to use and develop in confronting the analysis challenges of future LSS surveys.

nbodykit: an open-source, massively parallel toolkit for large-scale structure

TL;DR

nbodykit delivers a massively parallel, open-source Python toolkit for large-scale structure analysis, unifying canonical LSS algorithms within a modular, component-based framework built atop MPI. It combines interactive, notebook-friendly usage with HPC scalability through distributed Catalogs and Meshes, along with a comprehensive set of cosmology tools, data readers, and fast estimators for power spectra and correlation functions. The framework emphasizes reproducibility, automated testing, and thorough documentation, demonstrated through an illustrative galaxy clustering emulator and extensive performance benchmarks showing strong scaling on leadership-class systems. By integrating mature external libraries (e.g., Corrfunc, Halotools, FastPM) and providing open pathways for community contributions, nbodykit aims to become a standard foundation for future LSS analyses and tools.

Abstract

We present nbodykit, an open-source, massively parallel Python toolkit for analyzing large-scale structure (LSS) data. Using Python bindings of the Message Passing Interface (MPI), we provide parallel implementations of many commonly used algorithms in LSS. nbodykit is both an interactive and scalable piece of scientific software, performing well in a supercomputing environment while still taking advantage of the interactive tools provided by the Python ecosystem. Existing functionality includes estimators of the power spectrum, 2 and 3-point correlation functions, a Friends-of-Friends grouping algorithm, mock catalog creation via the halo occupation distribution technique, and approximate N-body simulations via the FastPM scheme. The package also provides a set of distributed data containers, insulated from the algorithms themselves, that enable nbodykit to provide a unified treatment of both simulation and observational data sets. nbodykit can be easily deployed in a high performance computing environment, overcoming some of the traditional difficulties of using Python on supercomputers. We provide performance benchmarks illustrating the scalability of the software. The modular, component-based approach of nbodykit allows researchers to easily build complex applications using its tools. The package is extensively documented at http://nbodykit.readthedocs.io, which also includes an interactive set of example recipes for new users to explore. As open-source software, we hope nbodykit provides a common framework for the community to use and develop in confronting the analysis challenges of future LSS surveys.

Paper Structure

This paper contains 36 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: The components and interfaces of nbodykit. The main Python classes are Catalog, Mesh, and Algorithm objects, which are described in more detail in §\ref{['sec:component-approach']}. Algorithm results can be consistent, where all processes hold the same data, or distributed, where data is spread out evenly across parallel processes.
  • Figure 2: A comparison of the effects of interlacing when using the CIC, TSC, and PCS windows. We show the ratio of the power spectrum computed for a log-normal density field using a mesh with $512^3$ cells to a reference power spectrum $P^\mathrm{ref}$, computed using a mesh with $1024^3$ cells. The ratio is shown as a function of wavenumber in units of the Nyquist frequency of the lower-resolution mesh. In all cases, the appropriate window compensation is performed using equation \ref{['eq:window-deconvolution']}.
  • Figure 3: The performance of the Daubechies and Symlet wavelets in comparison to the CIC, TSC, and PCS windows. Wavelet windows of sizes $a=$ 6, 12, and 20 are shown. Top: the ratio of the measured power to the reference power spectrum, as in Figure \ref{['fig:window-corrs']}. Here, we apply no corrections when using the wavelet windows and apply equation \ref{['eq:window-deconvolution']} for the CIC, TSC, and PCS windows. No interlacing is used for this test. Bottom: the speed of each interpolation window, relative to the CIC window. Speeds were recorded when computing the power spectra in the top panel.
  • Figure 4: Top: an analysis pipeline illustrating the creation of a Mesh object from a Catalog, as well as how to serialize the painted mesh to disk and preview a low-resolution projection of the density field for inspection. Bottom: the two-dimensional, low-resolution preview of the painted density field $N/\langle N \rangle = 1 + \delta$.
  • Figure 5: A galaxy clustering emulator, implemented with nbodykit. Left: the source code for the application, which evolves an initial Gaussian field to $z=0$ using the FastPM simulation scheme, identifies FOF halos, populates those halos with galaxies, and records the power spectrum of each step. Right, top: the flow of data through the various components. Right, bottom: the resulting $P(k)$ measured for each step in the emulator. Performance benchmarks for this application are given in Figure \ref{['fig:emu-benchmark']}.
  • ...and 3 more figures