Predicting galaxy bias using machine learning
Catalina Riveros-Jara, Antonio D. Montero-Dorta, Natália V. N. Rodrigues, Pía Amigo, Natalí S. M. de Santi, Andrés Balaguera-Antolínez, Raul Abramo, Neill Guzmán, M. Celeste Artale
TL;DR
This study tackles how galaxies trace the underlying matter density by predicting per-galaxy linear bias $b_i$ using an object-by-object estimator from IllustrisTNG300 and DisPerSE-derived cosmic-web distances. It compares three ML approaches—Random Forest, a neural network, and Normalizing Flows (neural spline flows)—to predict $b_i$ from halo properties and environment, including $oldsymbol{\delta_8}$ overdensity and distances to cosmic-web features. The results show that environmental features, particularly $oldsymbol{\delta_8}$, carry the strongest information about $b_i$, with internal halo properties playing a secondary role; importantly, Normalizing Flows outperform deterministic models by faithfully reproducing the full conditional distributions and joint distributions with other galaxy properties, capturing the intrinsic stochasticity of the matter–halo–galaxy connection. This probabilistic framework provides a robust pathway to measure individual galaxy bias in upcoming spectroscopic surveys and to generate realistic mock catalogs, with potential extensions to more complex inputs and larger simulations such as MilleniumTNG and to graph-based neural networks for improved modeling.
Abstract
Understanding how galaxies trace the underlying matter density field is essential for characterizing the influence of the large-scale structure on galaxy formation, being therefore a key ingredient in observational cosmology. This connection, commonly described through the galaxy bias, $b$, can be studied effectively using machine learning (ML) techniques, which offer strong predictive capabilities and can capture non-linear relationships. We aim to incorporate the linear bias parameter assigned to individual galaxies into a ML framework, quantify its dependence on various halo and environmental properties, and evaluate whether different algorithms can accurately predict this parameter and reproduce the scatter in several bias relations. We use data from the IllustrisTNG300 simulation, including the distance to different cosmic-web structures computed with DisPerSE. These data are complemented with an object-by-object estimator of the large-scale linear bias ($b_i$), providing the individual contribution of each galaxy to the bias of the entire population. Our ML framework uses three models to predict $b_i$: a Random Forest Regressor, a Neural Network and a probabilistic method (Normalizing Flows). We recover the full hierarchy of galaxy bias dependencies, showing that the most informative features are the overdensities, particularly $δ_8$, followed by the distances to cosmic-web structures and selected internal halo properties, most notably $z_{1/2}$. We also demonstrate that Normalizing Flows clearly outperform deterministic methods in predicting galaxy bias, including its joint distributions with galaxy properties, owing to their ability to capture the intrinsic variance associated with the stochastic nature of the matter-halo-galaxy connection. Our ML framework provides a foundation for future efforts to measure individual bias with upcoming spectroscopic surveys.
