Table of Contents
Fetching ...

Exploring Machine Learning Regression Models for Advancing Foreground Mitigation and Global 21cm Signal Parameter Extraction

Anshuman Tripathi, Abhirup Datta, Gursharanjit Kaur

TL;DR

This study benchmarks four regression models—Gaussian Process Regression, Random Forest Regression, Support Vector Regression, and Artificial Neural Networks—for extracting global 21cm-signal parameters and mitigating foreground contamination. Using tanh-based ARES signal simulations and a log-log polynomial foreground model, it assesses performance across ideal and foreground-dominated conditions, with and without PCA preprocessing, and over varying dataset sizes. The results show that ANN consistently delivers the best accuracy and scaling, especially when foregrounds are present, while GPR offers uncertainty quantification at high computational cost; SVR and RFR generally underperform in large, realistic datasets. PCA preprocessing markedly improves all models, highlighting a practical pathway to robust, efficient parameter extraction for global 21cm studies.

Abstract

Extracting parameters from the global 21cm signal is crucial for understanding the early Universe. However, detecting the 21cm signal is challenging due to the brighter foreground and associated observational difficulties. In this study, we evaluate the performance of various machine-learning regression models to improve parameter extraction and foreground removal. This evaluation is essential for selecting the most suitable machine learning regression model based on computational efficiency and predictive accuracy. We compare four models: Random Forest Regressor (RFR), Gaussian Process Regressor (GPR), Support Vector Regressor (SVR), and Artificial Neural Networks (ANN). The comparison is based on metrics such as the root mean square error (RMSE) and $R^2$ scores. We examine their effectiveness across different dataset sizes and conditions, including scenarios with foreground contamination. Our results indicate that ANN consistently outperforms the other models, achieving the lowest RMSE and the highest $R^2$ scores across multiple cases. While GPR also performs well, it is computationally intensive, requiring significant RAM and longer execution times. SVR struggles with large datasets due to its high computational costs, and RFR demonstrates the weakest accuracy among the models tested. We also found that employing Principal Component Analysis (PCA) as a preprocessing step significantly enhances model performance, especially in the presence of foregrounds.

Exploring Machine Learning Regression Models for Advancing Foreground Mitigation and Global 21cm Signal Parameter Extraction

TL;DR

This study benchmarks four regression models—Gaussian Process Regression, Random Forest Regression, Support Vector Regression, and Artificial Neural Networks—for extracting global 21cm-signal parameters and mitigating foreground contamination. Using tanh-based ARES signal simulations and a log-log polynomial foreground model, it assesses performance across ideal and foreground-dominated conditions, with and without PCA preprocessing, and over varying dataset sizes. The results show that ANN consistently delivers the best accuracy and scaling, especially when foregrounds are present, while GPR offers uncertainty quantification at high computational cost; SVR and RFR generally underperform in large, realistic datasets. PCA preprocessing markedly improves all models, highlighting a practical pathway to robust, efficient parameter extraction for global 21cm studies.

Abstract

Extracting parameters from the global 21cm signal is crucial for understanding the early Universe. However, detecting the 21cm signal is challenging due to the brighter foreground and associated observational difficulties. In this study, we evaluate the performance of various machine-learning regression models to improve parameter extraction and foreground removal. This evaluation is essential for selecting the most suitable machine learning regression model based on computational efficiency and predictive accuracy. We compare four models: Random Forest Regressor (RFR), Gaussian Process Regressor (GPR), Support Vector Regressor (SVR), and Artificial Neural Networks (ANN). The comparison is based on metrics such as the root mean square error (RMSE) and scores. We examine their effectiveness across different dataset sizes and conditions, including scenarios with foreground contamination. Our results indicate that ANN consistently outperforms the other models, achieving the lowest RMSE and the highest scores across multiple cases. While GPR also performs well, it is computationally intensive, requiring significant RAM and longer execution times. SVR struggles with large datasets due to its high computational costs, and RFR demonstrates the weakest accuracy among the models tested. We also found that employing Principal Component Analysis (PCA) as a preprocessing step significantly enhances model performance, especially in the presence of foregrounds.

Paper Structure

This paper contains 22 sections, 17 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Architecture of the ANN used for parameter estimation. Each circle represents a neuron, which is fully connected to the neurons in the next layer. The first layer is the input layer, the final layer is the output layer, and the intermediate layers are hidden layers activated by their respective activation functions.
  • Figure 2: Training dataset for the global 21-cm signal. The signal subsets are highlighted in red, while the remaining samples are shown in blue as background.
  • Figure 3: Training datasets for the global 21-cm signal, including added foregrounds and thermal noise. A subset of the datasets is highlighted in red, while the remaining samples are shown in blue as background.
  • Figure 4: The scatter plots show the predicted signal parameter values obtained using different machine learning models (RFR, GPR, SVR, ANN) trained on 10,000 signal-only datasets. In each plot, blue points represent ANN predictions, magenta points correspond to GPR, green points to SVR, and orange points to RFR. The solid black line represents the true parameter values, serving as a reference for model accuracy.
  • Figure 5: Comparison of average RMSE and $\rm R^{2}$ scores across different machine learning models (RFR, GPR, SVR, ANN) for various dataset configurations. The left panel displays the RMSE scores, while the right panel presents the $\rm R^{2}$ scores. The models are evaluated on datasets of different sizes: signal-only datasets (1000, 10000) and signal-with-foreground datasets (10000, 50000). Additionally, the performance of each model is compared when trained on raw signal-with-foreground data versus data processed with PCA.
  • ...and 3 more figures