Table of Contents
Fetching ...

$\mbox{H}$ $\mbox{I}$ 21-cm Absorption Spectra Classification using Machine Learning

Debasish Mondal, Anirudh S. Nemmani, Arunima Banerjee

TL;DR

The paper tackles the challenge of distinguishing intervening from associated HI 21-cm absorbers in large, blind surveys without relying on optical redshifts. It introduces Busy function fitting to extract physically meaningful spectral parameters from 21-cm absorption lines and trains multiple ML classifiers, identifying random forest as the most reliable, achieving up to ~89% test accuracy and an AUC around 0.94. The linewidth parameter $w_{20}$ emerges as the key discriminator, with two-parameter models ($w_{20}$ and $ au_{ ext{int}}$) nearly matching full-parameter performance and providing a parity with multi-Gaussian fits. The method demonstrates potential for scalable absorber-type classification in SKA-era surveys, including application to the FLASH sample, where predictions show strong agreement with prior ML-based labels, highlighting the approach’s practical impact for future large HI surveys.

Abstract

$\mbox{H}$ $\mbox{I}$ 21-cm absorption, an extremely useful tool to study the cold atomic hydrogen gas, can arise either from the intervening galaxies along the line-of-sight towards the background radio source or from the radio source itself. Determining whether $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines detected as part of large, blind surveys are `intervening' or `associated' using optical spectroscopy would be unfeasible. We therefore investigate a more efficient, machine learning (ML)-based method to classify $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines. Using a sample of 118 known $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines from the literature, we train six ML models (Gaussian naive Bayes, logistic regression, decision tree, random forest, SVM and XGBoost) on the spectral parameters obtained by fitting the Busy function to the absorption spectra. We found that a random forest model trained on these spectral parameters gives the most reliable classification results, with an accuracy of 89%, a $F_1$-score of 0.9 and an AUC score of 0.94. We note that the linewidth parameter $w_{20}$ is the most significant spectral parameter that regulates the classification performance of this model. Retraining this random forest model only with this linewidth and the integrated optical depth parameters yields an accuracy of 88%, a $F_1$-score of 0.88 and an AUC score of 0.91. We have applied this retrained random forest model to predict the type of 30 new $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines detected in recent blind surveys, viz. FLASH, illustrating the potential of the techniques developed in this work for future large $\mbox{H}$ $\mbox{I}$ surveys with the Square Kilometre Array.

$\mbox{H}$ $\mbox{I}$ 21-cm Absorption Spectra Classification using Machine Learning

TL;DR

The paper tackles the challenge of distinguishing intervening from associated HI 21-cm absorbers in large, blind surveys without relying on optical redshifts. It introduces Busy function fitting to extract physically meaningful spectral parameters from 21-cm absorption lines and trains multiple ML classifiers, identifying random forest as the most reliable, achieving up to ~89% test accuracy and an AUC around 0.94. The linewidth parameter emerges as the key discriminator, with two-parameter models ( and ) nearly matching full-parameter performance and providing a parity with multi-Gaussian fits. The method demonstrates potential for scalable absorber-type classification in SKA-era surveys, including application to the FLASH sample, where predictions show strong agreement with prior ML-based labels, highlighting the approach’s practical impact for future large HI surveys.

Abstract

21-cm absorption, an extremely useful tool to study the cold atomic hydrogen gas, can arise either from the intervening galaxies along the line-of-sight towards the background radio source or from the radio source itself. Determining whether 21-cm absorption lines detected as part of large, blind surveys are `intervening' or `associated' using optical spectroscopy would be unfeasible. We therefore investigate a more efficient, machine learning (ML)-based method to classify 21-cm absorption lines. Using a sample of 118 known 21-cm absorption lines from the literature, we train six ML models (Gaussian naive Bayes, logistic regression, decision tree, random forest, SVM and XGBoost) on the spectral parameters obtained by fitting the Busy function to the absorption spectra. We found that a random forest model trained on these spectral parameters gives the most reliable classification results, with an accuracy of 89%, a -score of 0.9 and an AUC score of 0.94. We note that the linewidth parameter is the most significant spectral parameter that regulates the classification performance of this model. Retraining this random forest model only with this linewidth and the integrated optical depth parameters yields an accuracy of 88%, a -score of 0.88 and an AUC score of 0.91. We have applied this retrained random forest model to predict the type of 30 new 21-cm absorption lines detected in recent blind surveys, viz. FLASH, illustrating the potential of the techniques developed in this work for future large surveys with the Square Kilometre Array.

Paper Structure

This paper contains 12 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Busy function vs. multi-Gaussian function fit to a galaxy spectrum in our sample -- SDSS J075756.71+395936.1. The data shown in red and blue dotted lines are for the Busy function fit and the multi-Gaussian function fit, respectively. The residuals of each fit are plotted at the bottom of the plot along with the $\pm1\sigma$ lines (black dotted lines).
  • Figure 2: The histograms of the 13 spectral parameters extracted from Busy function fitting, the absorber redshift ($z_\text{abs}$) and the SNR for all the associated (red) and intervening (blue) absorbers of our data sample. The solid and dashed lines denote the median values for the associated and intervening absorbers, respectively. The two-sample KS statistic and corresponding $\mathrm{p-value}$ for each parameter are mentioned in the legend.
  • Figure 4: Busy function vs. multi-Gaussian function fit -- the test accuracy evaluation of the random forest model over 1000 runs, where $\mu$ and $\sigma$ denote the average values of the mean and the standard deviations.
  • Figure 5: For the all spectral parameter sample, the predictive performance of the random forest model on the test data, where in (a) $\mu$ and $\sigma$ denote the average values of the mean and the standard deviations, and in (c) error bars are shown in red.
  • Figure 6: For the redshift cut sample, the predictive performance of the random forest model on the test data, where in (a) $\mu$ and $\sigma$ denote the average values of the mean and the standard deviations, and in (c) error bars are shown in red.
  • ...and 1 more figures