Machine Learning Approaches for Classifying Star-Forming Galaxies and Active Galactic Nuclei from MIGHTEE-Detected Radio Sources in the COSMOS Field
Walter Silima, Fangxia An, Mattia Vaccari, Eslam A. Hussein, S. Randriamampandry
TL;DR
This work tackles the problem of distinguishing star-formation-dominated and accretion-dominated radio sources in the MIGHTEE-COSMOS field using supervised ML. By comparing five classifiers on 18 multi-wavelength features and optimizing feature sets, the authors find that a five-feature combination centered on the infrared-radio correlation $q_ ext{IR}$, optical compactness, stellar mass, and two IRAC colours yields high $F1$-scores ($>90\%$) even with limited training data, with $k$-NN providing the best overall performance and stability. They perform extensive feature analyses (one- and two-dimensional, permutation, sequential importance, ROC curves) and demonstrate that including MIR colours is beneficial, while dimensionality reduction or feature scaling offers limited or negative gains. The results imply that ML-based SFG/AGN classification can be robustly applied to upcoming large radio surveys (e.g., SKA-era) and can operate effectively with incomplete X-ray/VLBI information, thereby facilitating rapid, scalable analyses of vast radio-continuum datasets. The study also highlights practical considerations such as feature completeness and the trade-offs between including more features versus maintaining a complete dataset.
Abstract
Radio synchrotron emission originates from both massive star formation and black hole accretion, two processes that drive galaxy evolution. Efficient classification of sources dominated by either process is therefore essential for fully exploiting deep, wide-field extragalactic radio continuum surveys. In this study, we implement, optimize, and compare five widely used supervised machine-learning (ML) algorithms to classify radio sources detected in the MeerKAT International GHz Tiered Extragalactic Exploration (MIGHTEE)-COSMOS survey as star-forming galaxies (SFGs) and active galactic nuclei (AGN). Training and test sets are constructed from conventionally classified MIGHTEE-COSMOS sources, and 18 physical parameters of the MIGHTEE-detected sources are evaluated as input features. As anticipated, our feature analyses rank the five parameters used in conventional classification as the most effective: the infrared-radio correlation parameter ($q_\mathrm{IR}$), the optical compactness morphology parameter (class$\_$star), stellar mass, and two combined mid-infrared colors. By optimizing the ML models with these selected features and testing classifiers across various feature combinations, we find that model performance generally improves as additional features are incorporated. Overall, all five algorithms yield an $F1$-score (the harmonic mean of precision and recall) $>90\%$ even when trained on only $20\%$ of the dataset. Among them, the distance-based $k$-nearest neighbors classifier demonstrates the highest accuracy and stability, establishing it as a robust and effective method for classifying SFGs and AGN in upcoming large radio continuum surveys.
