Table of Contents
Fetching ...

Populating Galaxies Into Halos Via Machine Learning on the Simba Simulation

Pratyush Kumar Das, Romeel Davé, Weiguang Cui

TL;DR

MIG presents an end-to-end machine-learning framework to populate dark-matter halos with galaxies by separating centrals and satellites, classifying SF versus Q systems, and regressing SF-subsets to predict $M_{*}$, SFR, $M_{\mathrm{HI}}$, $M_{\mathrm{H_2}}$, and $Z$, trained on the Simba simulation. The study shows that a fraction-based prediction approach, combined with TPOT AutoML and RF feature selection, yields high accuracy across redshifts $z=0,1,2$, with particularly strong gains for satellite galaxies. MIG also recovers galaxy mass functions more faithfully than direct prediction methods, enabling precise predictions of baryonic tracers for large-volume HI intensity mapping. The framework provides a scalable, physically informed method to generate mock galaxy catalogs and tracers for upcoming surveys, while highlighting the importance of SF/Q separation and feature selection in capturing the complex halo–galaxy connection.

Abstract

We present a machine-learning framework, Machine Inferred Galaxy (MIG), to populate dark-matter haloes with galaxies in N-body simulations. MIG predicts stellar mass ($M_\ast$), star-formation rate (SFR), atomic and molecular gas masses ($M_{\mathrm{HI}}$ and $M_{\mathrm{H_2}}$), and metallicity, and can be extended to other properties and simulations. The pipeline first separates haloes into centrals and satellites, then uses classifiers to distinguish star-forming (SF) from quenched (Q) systems, followed by regressors trained on the SF subsets for both centrals and satellites. Trained on the $(100,h^{-1},\mathrm{Mpc})^3$ SIMBA galaxy-formation simulation at $z=0$, MIG achieves high accuracy for key baryonic properties (e.g. $R^2 \approx 0.9$ for $M_{\mathrm{HI}}$ of central galaxies), and remains robust at $z=1$ and $z=2$. Training on fractional quantities (e.g. $M_{\mathrm{HI}}/M_\ast$) and rescaling by predicted $M_\ast$ improves performance over direct predictions across properties and redshifts. MIG also reproduces galaxy mass distribution functions with higher fidelity, enabling accurate predictions of integrated tracers such as H I intensity maps. MIG therefore provides an efficient, physically consistent route to generate mock galaxy catalogues and baryonic tracers in large cosmological volumes for upcoming surveys.

Populating Galaxies Into Halos Via Machine Learning on the Simba Simulation

TL;DR

MIG presents an end-to-end machine-learning framework to populate dark-matter halos with galaxies by separating centrals and satellites, classifying SF versus Q systems, and regressing SF-subsets to predict , SFR, , , and , trained on the Simba simulation. The study shows that a fraction-based prediction approach, combined with TPOT AutoML and RF feature selection, yields high accuracy across redshifts , with particularly strong gains for satellite galaxies. MIG also recovers galaxy mass functions more faithfully than direct prediction methods, enabling precise predictions of baryonic tracers for large-volume HI intensity mapping. The framework provides a scalable, physically informed method to generate mock galaxy catalogs and tracers for upcoming surveys, while highlighting the importance of SF/Q separation and feature selection in capturing the complex halo–galaxy connection.

Abstract

We present a machine-learning framework, Machine Inferred Galaxy (MIG), to populate dark-matter haloes with galaxies in N-body simulations. MIG predicts stellar mass (), star-formation rate (SFR), atomic and molecular gas masses ( and ), and metallicity, and can be extended to other properties and simulations. The pipeline first separates haloes into centrals and satellites, then uses classifiers to distinguish star-forming (SF) from quenched (Q) systems, followed by regressors trained on the SF subsets for both centrals and satellites. Trained on the SIMBA galaxy-formation simulation at , MIG achieves high accuracy for key baryonic properties (e.g. for of central galaxies), and remains robust at and . Training on fractional quantities (e.g. ) and rescaling by predicted improves performance over direct predictions across properties and redshifts. MIG also reproduces galaxy mass distribution functions with higher fidelity, enabling accurate predictions of integrated tracers such as H I intensity maps. MIG therefore provides an efficient, physically consistent route to generate mock galaxy catalogues and baryonic tracers in large cosmological volumes for upcoming surveys.

Paper Structure

This paper contains 21 sections, 7 equations, 9 figures.

Figures (9)

  • Figure 1: In this flowchart, we show the workflow of MIG. Starting from the top, galaxies are separated into central and satellite subgroups. Following this, we introduce ML classifiers in layer 1 to classify each subgroup into SF and Q galaxies, based on the relevant galaxy feature. Subsequently, distinct regressors are employed in layer 2 to independently train on the SF galaxies, encompassing both centrals and satellites. The Q galaxies are assigned a constant value below the SF–Q boundary and merged with the SF predictions to yield the final results.
  • Figure 2: An illustration of the two approaches we follow to predict the target galaxy features. Here we show the example of $M_{HI}$ prediction. The left half of the figure shows the traditional approach, where ML frameworks are trained to predict $M_{HI}$ directly. The right half shows the fraction-based approach, where we train two ML frameworks for $M_{fHI}$ ($=M_{HI}/M_{*}$) and $M_{*}$ respectively. Ultimately, we multiply them to get back $M_{HI}$ or the main target feature.
  • Figure 3: Performance metrics of the classifier layer of MIG for central (red) and satellite galaxies (blue) at $z=0$. The scores corresponding to each metric are shown inside their respective bars.
  • Figure 4: Prediction accuracy for the best-performing frameworks for fractions-based and feature-based approaches using the Simba simulation at $z=0$. The scatter plots have the true (Simba) values on the x-axis, while the y-axis shows the predicted values. (a) has ML predictions for central galaxies, and (b) has ML predictions for satellite galaxies.
  • Figure 5: Performance metrics of the classifier layer of MIG for central (red) and satellite galaxies (blue) at $z=1,2$. The scores for each metric are indicated inside the respective bars.
  • ...and 4 more figures