Table of Contents
Fetching ...

Unveiling the drivers of the Baryon Cycles with Interpretable Multi-step Machine Learning and Simulations

Mst Shamima Khanom, Benjamin W. Keller, Javier Ignacio Saavedra Moreno

TL;DR

This study tackles the missing baryon problem by linking the retained baryon fraction in simulated galaxies to halo and gas properties using a two-step, interpretable ML pipeline applied to IllustrisTNG100. A Random Forest identifies the five most informative features, which are then modeled with an Explainable Boosting Machine to uncover univariate and interaction-driven dependencies, yielding a high predictive accuracy ($R^2$ about 0.86–0.87) with only five features. The analysis reveals that central gas content, star-forming gas, halo mass, the radius of the rotation-curve peak, and velocity dispersion jointly govern baryon retention, with strong virial-related interactions between $M_{200}$ and $\sigma$. The results show that galaxies outside strict virial equilibrium tend to retain more baryons, and the correlation between $M_{200}$ and $\sigma$ reflects information leakage rather than independent causal drivers, informing our understanding of the baryon cycle and guiding future multisimulation investigations.

Abstract

We present a new approach for understanding how galaxies lose or retain baryons by utilizing a pipeline of two machine learning methods applied the IllustrisTNG100 simulation. We employed a Random Forest Regressor and Explainable Boosting Machine (EBM) model to connect the retained baryon fraction of approximately 10^5 simulated galaxies to their properties. We employed Random Forest models to filter and used the five most significant properties to train an EBM. Interaction functions identified by the EBM highlight the relationship between baryon fraction and three different galactic mass measurements, the location of the rotation curve peak, and the velocity dispersion. This interpretable machine learning-based approach provides a promising pathway for understanding the baryon cycle in galaxies.

Unveiling the drivers of the Baryon Cycles with Interpretable Multi-step Machine Learning and Simulations

TL;DR

This study tackles the missing baryon problem by linking the retained baryon fraction in simulated galaxies to halo and gas properties using a two-step, interpretable ML pipeline applied to IllustrisTNG100. A Random Forest identifies the five most informative features, which are then modeled with an Explainable Boosting Machine to uncover univariate and interaction-driven dependencies, yielding a high predictive accuracy ( about 0.86–0.87) with only five features. The analysis reveals that central gas content, star-forming gas, halo mass, the radius of the rotation-curve peak, and velocity dispersion jointly govern baryon retention, with strong virial-related interactions between and . The results show that galaxies outside strict virial equilibrium tend to retain more baryons, and the correlation between and reflects information leakage rather than independent causal drivers, informing our understanding of the baryon cycle and guiding future multisimulation investigations.

Abstract

We present a new approach for understanding how galaxies lose or retain baryons by utilizing a pipeline of two machine learning methods applied the IllustrisTNG100 simulation. We employed a Random Forest Regressor and Explainable Boosting Machine (EBM) model to connect the retained baryon fraction of approximately 10^5 simulated galaxies to their properties. We employed Random Forest models to filter and used the five most significant properties to train an EBM. Interaction functions identified by the EBM highlight the relationship between baryon fraction and three different galactic mass measurements, the location of the rotation curve peak, and the velocity dispersion. This interpretable machine learning-based approach provides a promising pathway for understanding the baryon cycle in galaxies.

Paper Structure

This paper contains 30 sections, 9 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: The processing pipeline is used in our work. Initially, we extracted data from the IllustrisTNG simulation, including 107,867 simulated galaxies and 66 features. Then, a Random Forest model was used to identify the top 5 important features. An EBM was then trained on these features using a 75%/25% train-test split to analyze their relationships with the target variable. Finally, the EBM results explored univariate and bivariate functions, providing insights into galaxy properties.
  • Figure 2: Accuracy of the two ML models at predicting $f_Bar$: the Random Forest Regressor (left) with 66 features and the EBM (right) with 5 features. The red line shows where predictions equal the true values, demonstrating a strong correlation with $R^2$ scores of 0.897 for the Random Forest model and 0.866 for the EBM.
  • Figure 3: The univariate feature functions ($f_{i}$) for the EBM model trained to predict the baryon fractions in galaxies. From the left to right, the feature functions correspond to gas mass within the $R_{V_{Max}}$ ($M_{\text{gas, MaxRad}}$), star-forming gas mass($M_{SFG}$), halo mass ($M_{200}$), radius at the maximum rotational velocity ($R_{V_{Max}}$), velocity dispersion ($\sigma$). Light blue areas above zero indicate regions where $f_{i}>0$, while light coral areas below zero indicate regions where $f_{i}<0$. Negative values of $f_{Bar}$ indicate a reduction in the predicted baryon fraction relative to the baseline ($\beta = 0.0542$), not a negative baryon fraction. The total prediction remains positive when all contributions, including $\beta$ are summed.
  • Figure 4: Direct relationships between baryon fraction and each of the five most important features identified by the Random Forest model, shown as raw scatterplots.
  • Figure 5: Retained baryon fraction $f_{bar}$ as a function of $M_{200}$. The hexbin background shows the raw galaxy distribution (log-scaled counts), and the red line shows the EBM univariate function (right y-axis), and the orange line indicates the median trend from binned data (left y-axis). The dashed hot pink curve shows the median of the full EBM prediction (left y-axis). The purple-shaded region marks the low-mass range not included in the wright2024, while the skyblue-shaded region highlights the high-mass regime where the number of halos is very small (380).
  • ...and 12 more figures