Exploring Machine Learning Regression Models for Advancing Foreground Mitigation and Global 21cm Signal Parameter Extraction
Anshuman Tripathi, Abhirup Datta, Gursharanjit Kaur
TL;DR
This study benchmarks four regression models—Gaussian Process Regression, Random Forest Regression, Support Vector Regression, and Artificial Neural Networks—for extracting global 21cm-signal parameters and mitigating foreground contamination. Using tanh-based ARES signal simulations and a log-log polynomial foreground model, it assesses performance across ideal and foreground-dominated conditions, with and without PCA preprocessing, and over varying dataset sizes. The results show that ANN consistently delivers the best accuracy and scaling, especially when foregrounds are present, while GPR offers uncertainty quantification at high computational cost; SVR and RFR generally underperform in large, realistic datasets. PCA preprocessing markedly improves all models, highlighting a practical pathway to robust, efficient parameter extraction for global 21cm studies.
Abstract
Extracting parameters from the global 21cm signal is crucial for understanding the early Universe. However, detecting the 21cm signal is challenging due to the brighter foreground and associated observational difficulties. In this study, we evaluate the performance of various machine-learning regression models to improve parameter extraction and foreground removal. This evaluation is essential for selecting the most suitable machine learning regression model based on computational efficiency and predictive accuracy. We compare four models: Random Forest Regressor (RFR), Gaussian Process Regressor (GPR), Support Vector Regressor (SVR), and Artificial Neural Networks (ANN). The comparison is based on metrics such as the root mean square error (RMSE) and $R^2$ scores. We examine their effectiveness across different dataset sizes and conditions, including scenarios with foreground contamination. Our results indicate that ANN consistently outperforms the other models, achieving the lowest RMSE and the highest $R^2$ scores across multiple cases. While GPR also performs well, it is computationally intensive, requiring significant RAM and longer execution times. SVR struggles with large datasets due to its high computational costs, and RFR demonstrates the weakest accuracy among the models tested. We also found that employing Principal Component Analysis (PCA) as a preprocessing step significantly enhances model performance, especially in the presence of foregrounds.
