CNN+FoF: application of deep learning to the identification of dark matter haloes

Soumadeep Maiti; Carlos M. Correa; Andrea Fiorilli; Andrés N. Ruiz; Dante J. Paz; Alejandro Pérez Fernández; Ariel G. Sánchez

CNN+FoF: application of deep learning to the identification of dark matter haloes

Soumadeep Maiti, Carlos M. Correa, Andrea Fiorilli, Andrés N. Ruiz, Dante J. Paz, Alejandro Pérez Fernández, Ariel G. Sánchez

TL;DR

The primary objective of this study is to offer a faster and scalable alternative to conventional halo finders, achieving a speed-up of approximately one order of magnitude relative to ROCKSTAR, offering a promising pathway for modern simulation-based inference methods that rely on rapid and accurate structure identification.

Abstract

We present a deep-learning-based approach for identifying dark matter haloes in cosmological N-body simulations. Our framework consists of a volumetric Convolutional Neural Network to classify individual simulation particles as either halo or non-halo members, followed by a highly optimised and parallelised Friends-of-Friends clustering algorithm that groups the classified halo members into distinct haloes. The training data comprise simulations generated using GADGET-4, with labels obtained with the ROCKSTAR halo finder. Our models incorporate two main halo mass definitions, $M_{200\mathrm{b}}$ and $M_{\text{vir}}$, with similar performance. For haloes defined by the ROCKSTAR $M_{200\mathrm{b}}$ criterion, the classification network demonstrated stable performance across multiple simulation resolutions. For the highest resolution, it achieved over $98\%$ across all primary performance metrics when identifying halo particles. Furthermore, the FoF algorithm yielded halo catalogues with a purity generally exceeding $95\%$ and a stable completeness of $93\%$ for masses above $5\times10^{11} \, M_\odot$. Our pipeline recovered the centre-of-mass positions, velocities and halo masses with high fidelity, yielding a halo mass function consistent to within $5\%$ of the reference while faithfully reconstructing the internal density profiles. The primary objective of this study is to offer a faster and scalable alternative to conventional halo finders, achieving a speed-up of approximately one order of magnitude relative to ROCKSTAR, offering a promising pathway for modern simulation-based inference methods that rely on rapid and accurate structure identification.

CNN+FoF: application of deep learning to the identification of dark matter haloes

TL;DR

Abstract

and

, with similar performance. For haloes defined by the ROCKSTAR

criterion, the classification network demonstrated stable performance across multiple simulation resolutions. For the highest resolution, it achieved over

across all primary performance metrics when identifying halo particles. Furthermore, the FoF algorithm yielded halo catalogues with a purity generally exceeding

and a stable completeness of

for masses above

. Our pipeline recovered the centre-of-mass positions, velocities and halo masses with high fidelity, yielding a halo mass function consistent to within

of the reference while faithfully reconstructing the internal density profiles. The primary objective of this study is to offer a faster and scalable alternative to conventional halo finders, achieving a speed-up of approximately one order of magnitude relative to ROCKSTAR, offering a promising pathway for modern simulation-based inference methods that rely on rapid and accurate structure identification.

Paper Structure (12 sections, 6 equations, 9 figures, 9 tables)

This paper contains 12 sections, 6 equations, 9 figures, 9 tables.

Introduction
Simulation data set
Hybrid Halo-Finding Pipeline
CNN for binary classification
FoF algorithm to group distinct haloes
Results
Particle classification accuracy
Recovering halo properties
Computational performance
CONCLUSIONS AND FUTURE WORK
Performance with the $M_{\mathrm{vir}}$ mass definition
Confusion matrix and derived metrics

Figures (9)

Figure 1: ROC curve showing the CNN's ability to distinguish halo and non-halo particles across varying probability thresholds. The red point indicates the best threshold for the classification, corresponding to $y=0.498$.
Figure 2: Spatial distribution of particles in one of the $L200$-$N128^3$ test simulations, colour-coded by classification category: true positives (green), false positives (red), false negatives (orange), and true negatives (grey). The main panel displays a projected slice (depth of $2.5\%$ of the box size) illustrating the large-scale cosmic web. The inset zooms in on a representative halo identified by ROCKSTAR, with the centre marked by a purple cross and the $r_{200\mathrm{b}}$ radius indicated by a dashed blue circle. The network successfully traces the main halo body, while misclassifications are concentrated at the halo outskirts.
Figure 3: Normalised radial distribution of particles relative to the nearest ROCKSTAR halo centre, scaled by the $r_{200\mathrm{b}}$ radius. The vertical dashed line marks the halo boundary at $r = r_{200\mathrm{b}}$. Inner regions are dominated by true positives (green), which decline at the halo radius, indicating accurate identification of gravitationally bound particles. Misclassifications (false negatives, orange; false positives, red) peak around the halo radius, reflecting the challenge of distinguishing particles located near the boundary. True negatives (grey) dominate the outer regions, although there is a significant fraction of them located inside $r_{200\mathrm{b}}$. Histogram counts are shown on a logarithmic scale.
Figure 4: Mass-dependent purity (red) and completeness (black) of the CNN+FoF halo catalogue relative to the reference ROCKSTAR sample. The completeness exceeds $90\%$ across a wide mass range, remaining stable at $\sim 93\%$ above $M = 5 \times 10^{11} \, M_{\odot}$ (grey vertical line). The drop at the low-mass arises as we approach the resolution limit. Conversely, the purity generally exceeds $95\%$, but exhibits a mass-dependent decline. This drop can be attributed to FoF fragmentation at lower masses and to the merging of adjacent structures at the high-mass end.
Figure 5: Accuracy of recovered halo properties for the matched CNN+FoF and ROCKSTAR samples. Top: Distribution of centre-of-mass position offsets, $\Delta X_i$, normalised by the $r_{200\mathrm{b}}$ radius, along each Cartesian axis. Bottom: Distribution of the component-wise velocity ratios, $V_i^{\texttt{ROCKSTAR}} / V_i^{\rm CNN+FoF{}}$. The position offsets are sharply peaked at zero, while the velocity ratios cluster tightly around unity, demonstrating that the pipeline recovers both the spatial and bulk dynamical properties of haloes with high fidelity.
...and 4 more figures

CNN+FoF: application of deep learning to the identification of dark matter haloes

TL;DR

Abstract

CNN+FoF: application of deep learning to the identification of dark matter haloes

Authors

TL;DR

Abstract

Table of Contents

Figures (9)