Unveiling the drivers of the Baryon Cycles with Interpretable Multi-step Machine Learning and Simulations
Mst Shamima Khanom, Benjamin W. Keller, Javier Ignacio Saavedra Moreno
TL;DR
This study tackles the missing baryon problem by linking the retained baryon fraction in simulated galaxies to halo and gas properties using a two-step, interpretable ML pipeline applied to IllustrisTNG100. A Random Forest identifies the five most informative features, which are then modeled with an Explainable Boosting Machine to uncover univariate and interaction-driven dependencies, yielding a high predictive accuracy ($R^2$ about 0.86–0.87) with only five features. The analysis reveals that central gas content, star-forming gas, halo mass, the radius of the rotation-curve peak, and velocity dispersion jointly govern baryon retention, with strong virial-related interactions between $M_{200}$ and $\sigma$. The results show that galaxies outside strict virial equilibrium tend to retain more baryons, and the correlation between $M_{200}$ and $\sigma$ reflects information leakage rather than independent causal drivers, informing our understanding of the baryon cycle and guiding future multisimulation investigations.
Abstract
We present a new approach for understanding how galaxies lose or retain baryons by utilizing a pipeline of two machine learning methods applied the IllustrisTNG100 simulation. We employed a Random Forest Regressor and Explainable Boosting Machine (EBM) model to connect the retained baryon fraction of approximately 10^5 simulated galaxies to their properties. We employed Random Forest models to filter and used the five most significant properties to train an EBM. Interaction functions identified by the EBM highlight the relationship between baryon fraction and three different galactic mass measurements, the location of the rotation curve peak, and the velocity dispersion. This interpretable machine learning-based approach provides a promising pathway for understanding the baryon cycle in galaxies.
