Predicting galaxy bias using machine learning

Catalina Riveros-Jara; Antonio D. Montero-Dorta; Natália V. N. Rodrigues; Pía Amigo; Natalí S. M. de Santi; Andrés Balaguera-Antolínez; Raul Abramo; Neill Guzmán; M. Celeste Artale

Predicting galaxy bias using machine learning

Catalina Riveros-Jara, Antonio D. Montero-Dorta, Natália V. N. Rodrigues, Pía Amigo, Natalí S. M. de Santi, Andrés Balaguera-Antolínez, Raul Abramo, Neill Guzmán, M. Celeste Artale

TL;DR

This study tackles how galaxies trace the underlying matter density by predicting per-galaxy linear bias $b_i$ using an object-by-object estimator from IllustrisTNG300 and DisPerSE-derived cosmic-web distances. It compares three ML approaches—Random Forest, a neural network, and Normalizing Flows (neural spline flows)—to predict $b_i$ from halo properties and environment, including $oldsymbol{\delta_8}$ overdensity and distances to cosmic-web features. The results show that environmental features, particularly $oldsymbol{\delta_8}$, carry the strongest information about $b_i$, with internal halo properties playing a secondary role; importantly, Normalizing Flows outperform deterministic models by faithfully reproducing the full conditional distributions and joint distributions with other galaxy properties, capturing the intrinsic stochasticity of the matter–halo–galaxy connection. This probabilistic framework provides a robust pathway to measure individual galaxy bias in upcoming spectroscopic surveys and to generate realistic mock catalogs, with potential extensions to more complex inputs and larger simulations such as MilleniumTNG and to graph-based neural networks for improved modeling.

Abstract

Understanding how galaxies trace the underlying matter density field is essential for characterizing the influence of the large-scale structure on galaxy formation, being therefore a key ingredient in observational cosmology. This connection, commonly described through the galaxy bias, $b$, can be studied effectively using machine learning (ML) techniques, which offer strong predictive capabilities and can capture non-linear relationships. We aim to incorporate the linear bias parameter assigned to individual galaxies into a ML framework, quantify its dependence on various halo and environmental properties, and evaluate whether different algorithms can accurately predict this parameter and reproduce the scatter in several bias relations. We use data from the IllustrisTNG300 simulation, including the distance to different cosmic-web structures computed with DisPerSE. These data are complemented with an object-by-object estimator of the large-scale linear bias ($b_i$), providing the individual contribution of each galaxy to the bias of the entire population. Our ML framework uses three models to predict $b_i$: a Random Forest Regressor, a Neural Network and a probabilistic method (Normalizing Flows). We recover the full hierarchy of galaxy bias dependencies, showing that the most informative features are the overdensities, particularly $δ_8$, followed by the distances to cosmic-web structures and selected internal halo properties, most notably $z_{1/2}$. We also demonstrate that Normalizing Flows clearly outperform deterministic methods in predicting galaxy bias, including its joint distributions with galaxy properties, owing to their ability to capture the intrinsic variance associated with the stochastic nature of the matter-halo-galaxy connection. Our ML framework provides a foundation for future efforts to measure individual bias with upcoming spectroscopic surveys.

Predicting galaxy bias using machine learning

TL;DR

This study tackles how galaxies trace the underlying matter density by predicting per-galaxy linear bias

using an object-by-object estimator from IllustrisTNG300 and DisPerSE-derived cosmic-web distances. It compares three ML approaches—Random Forest, a neural network, and Normalizing Flows (neural spline flows)—to predict

from halo properties and environment, including

overdensity and distances to cosmic-web features. The results show that environmental features, particularly

, carry the strongest information about

, with internal halo properties playing a secondary role; importantly, Normalizing Flows outperform deterministic models by faithfully reproducing the full conditional distributions and joint distributions with other galaxy properties, capturing the intrinsic stochasticity of the matter–halo–galaxy connection. This probabilistic framework provides a robust pathway to measure individual galaxy bias in upcoming spectroscopic surveys and to generate realistic mock catalogs, with potential extensions to more complex inputs and larger simulations such as MilleniumTNG and to graph-based neural networks for improved modeling.

Abstract

, can be studied effectively using machine learning (ML) techniques, which offer strong predictive capabilities and can capture non-linear relationships. We aim to incorporate the linear bias parameter assigned to individual galaxies into a ML framework, quantify its dependence on various halo and environmental properties, and evaluate whether different algorithms can accurately predict this parameter and reproduce the scatter in several bias relations. We use data from the IllustrisTNG300 simulation, including the distance to different cosmic-web structures computed with DisPerSE. These data are complemented with an object-by-object estimator of the large-scale linear bias (

), providing the individual contribution of each galaxy to the bias of the entire population. Our ML framework uses three models to predict

: a Random Forest Regressor, a Neural Network and a probabilistic method (Normalizing Flows). We recover the full hierarchy of galaxy bias dependencies, showing that the most informative features are the overdensities, particularly

, followed by the distances to cosmic-web structures and selected internal halo properties, most notably

. We also demonstrate that Normalizing Flows clearly outperform deterministic methods in predicting galaxy bias, including its joint distributions with galaxy properties, owing to their ability to capture the intrinsic variance associated with the stochastic nature of the matter-halo-galaxy connection. Our ML framework provides a foundation for future efforts to measure individual bias with upcoming spectroscopic surveys.

Paper Structure (18 sections, 7 equations, 10 figures, 7 tables)

This paper contains 18 sections, 7 equations, 10 figures, 7 tables.

Introduction
Data
IllustrisTNG
Internal halo and galaxy properties
Local environment and cosmic web
Large scale bias
Machine learning framework
Models
Evaluation metrics
Correlation between features
Predicting galaxy bias
Training procedure and model optimization
Model comparison
Individual probability distributions
Summary and conclusions
...and 3 more sections

Figures (10)

Figure 1: Correlation of each selected property with individual galaxy bias, considering the entire mass range. Each bar is accompanied with the value of the linear correlation (PCC) between both features.
Figure 2: Correlation coefficient between $b_i$ and each selected feature within four different halo mass bins. The bins span 0.25 dex, except for the most massive bin.
Figure 3: Performance of each model in predicting the individual galaxy bias. The top row shows the true bias values distribution (in pink), along with the predictions from each model in different colors. The NF predictions present both a random realization (NF) and the expected value across the entire sample of realizations (NF Mean). The bottom row displays scatter plots comparing the predicted values (by each model) as a function of the true values, colored by normalized density. Notably, the bottom-right plot emphasizes the use of the NF random realization. The black dashed lines represent the ideal case where predicted values match the true values.
Figure 4: Individual probability distributions (in purple) for one random galaxy in the sample obtained with the NF model trained with three different inputs. The left panel corresponds to the predicted values using $\lambda_{\text{halo}}$ as input, the center panel to the case only using $\delta_8$ as input while the right panel to the case using the complete set of properties described in Sec. \ref{['sec:Data']}. The blue dashed line corresponds to the true individual bias value of the galaxy ($b_i = -1.19$), while the cyan distributions show the true bias values as shown in pink in Fig. \ref{['fig:3_models_comparison']}.
Figure 5: Predicted individual probability distributions obtained with the NF model when using different input properties (only $\lambda_{\text{halo}}$ in purple, only $\delta_8$ in crimson and the full set of halo and environmental properties in green). The top panel shows the distributions obtained when considering one random realization from the entire set of sampled values from the conditional probability distributions, while the bottom panel shows the case when taking the mean values of the individual probability distributions of each galaxy.
...and 5 more figures

Predicting galaxy bias using machine learning

TL;DR

Abstract

Predicting galaxy bias using machine learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)