$\mbox{H}$ $\mbox{I}$ 21-cm Absorption Spectra Classification using Machine Learning
Debasish Mondal, Anirudh S. Nemmani, Arunima Banerjee
TL;DR
The paper tackles the challenge of distinguishing intervening from associated HI 21-cm absorbers in large, blind surveys without relying on optical redshifts. It introduces Busy function fitting to extract physically meaningful spectral parameters from 21-cm absorption lines and trains multiple ML classifiers, identifying random forest as the most reliable, achieving up to ~89% test accuracy and an AUC around 0.94. The linewidth parameter $w_{20}$ emerges as the key discriminator, with two-parameter models ($w_{20}$ and $ au_{ ext{int}}$) nearly matching full-parameter performance and providing a parity with multi-Gaussian fits. The method demonstrates potential for scalable absorber-type classification in SKA-era surveys, including application to the FLASH sample, where predictions show strong agreement with prior ML-based labels, highlighting the approach’s practical impact for future large HI surveys.
Abstract
$\mbox{H}$ $\mbox{I}$ 21-cm absorption, an extremely useful tool to study the cold atomic hydrogen gas, can arise either from the intervening galaxies along the line-of-sight towards the background radio source or from the radio source itself. Determining whether $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines detected as part of large, blind surveys are `intervening' or `associated' using optical spectroscopy would be unfeasible. We therefore investigate a more efficient, machine learning (ML)-based method to classify $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines. Using a sample of 118 known $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines from the literature, we train six ML models (Gaussian naive Bayes, logistic regression, decision tree, random forest, SVM and XGBoost) on the spectral parameters obtained by fitting the Busy function to the absorption spectra. We found that a random forest model trained on these spectral parameters gives the most reliable classification results, with an accuracy of 89%, a $F_1$-score of 0.9 and an AUC score of 0.94. We note that the linewidth parameter $w_{20}$ is the most significant spectral parameter that regulates the classification performance of this model. Retraining this random forest model only with this linewidth and the integrated optical depth parameters yields an accuracy of 88%, a $F_1$-score of 0.88 and an AUC score of 0.91. We have applied this retrained random forest model to predict the type of 30 new $\mbox{H}$ $\mbox{I}$ 21-cm absorption lines detected in recent blind surveys, viz. FLASH, illustrating the potential of the techniques developed in this work for future large $\mbox{H}$ $\mbox{I}$ surveys with the Square Kilometre Array.
