CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning
Ningyuan Huang, Richard Stiskalek, Jun-Young Lee, Adrian E. Bayer, Charles C. Margossian, Christian Kragh Jespersen, Lucia A. Perez, Lawrence K. Saul, Francisco Villaescusa-Navarro
TL;DR
CosmoBench tackles the challenge of applying geometric deep learning to cosmology by providing a large, multiscale, multimodal benchmark built from simulations that consumed over $4.1\times10^{7}$ core-hours and generated more than $2$ PB of data. It offers 34k point clouds across three scales and 25k directed merger trees across two time scales, enabling graph-level cosmological-parameter regression, node-level velocity prediction, and merger-tree super-resolution, with baselines spanning 2PCF-based ML, invariant-feature linear models, and graph neural networks. Across Quijote, CAMELS-SAM, and CAMELS, the results reveal that simple linear/invariant features can rival or outperform heavy ML models on large scales, while ML methods, including GNNs and higher-order graphs, can provide advantages on smaller scales and with redshift-space data. By delivering open data and code and outlining future expansions (more data, emulators, and observational realism), CosmoBench aims to catalyze collaboration between cosmology and geometric DL to accelerate scientific discovery.
Abstract
Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks -- to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches -- from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training time. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app
