Table of Contents
Fetching ...

Machine Learning Approaches for Classifying Star-Forming Galaxies and Active Galactic Nuclei from MIGHTEE-Detected Radio Sources in the COSMOS Field

Walter Silima, Fangxia An, Mattia Vaccari, Eslam A. Hussein, S. Randriamampandry

TL;DR

This work tackles the problem of distinguishing star-formation-dominated and accretion-dominated radio sources in the MIGHTEE-COSMOS field using supervised ML. By comparing five classifiers on 18 multi-wavelength features and optimizing feature sets, the authors find that a five-feature combination centered on the infrared-radio correlation $q_ ext{IR}$, optical compactness, stellar mass, and two IRAC colours yields high $F1$-scores ($>90\%$) even with limited training data, with $k$-NN providing the best overall performance and stability. They perform extensive feature analyses (one- and two-dimensional, permutation, sequential importance, ROC curves) and demonstrate that including MIR colours is beneficial, while dimensionality reduction or feature scaling offers limited or negative gains. The results imply that ML-based SFG/AGN classification can be robustly applied to upcoming large radio surveys (e.g., SKA-era) and can operate effectively with incomplete X-ray/VLBI information, thereby facilitating rapid, scalable analyses of vast radio-continuum datasets. The study also highlights practical considerations such as feature completeness and the trade-offs between including more features versus maintaining a complete dataset.

Abstract

Radio synchrotron emission originates from both massive star formation and black hole accretion, two processes that drive galaxy evolution. Efficient classification of sources dominated by either process is therefore essential for fully exploiting deep, wide-field extragalactic radio continuum surveys. In this study, we implement, optimize, and compare five widely used supervised machine-learning (ML) algorithms to classify radio sources detected in the MeerKAT International GHz Tiered Extragalactic Exploration (MIGHTEE)-COSMOS survey as star-forming galaxies (SFGs) and active galactic nuclei (AGN). Training and test sets are constructed from conventionally classified MIGHTEE-COSMOS sources, and 18 physical parameters of the MIGHTEE-detected sources are evaluated as input features. As anticipated, our feature analyses rank the five parameters used in conventional classification as the most effective: the infrared-radio correlation parameter ($q_\mathrm{IR}$), the optical compactness morphology parameter (class$\_$star), stellar mass, and two combined mid-infrared colors. By optimizing the ML models with these selected features and testing classifiers across various feature combinations, we find that model performance generally improves as additional features are incorporated. Overall, all five algorithms yield an $F1$-score (the harmonic mean of precision and recall) $>90\%$ even when trained on only $20\%$ of the dataset. Among them, the distance-based $k$-nearest neighbors classifier demonstrates the highest accuracy and stability, establishing it as a robust and effective method for classifying SFGs and AGN in upcoming large radio continuum surveys.

Machine Learning Approaches for Classifying Star-Forming Galaxies and Active Galactic Nuclei from MIGHTEE-Detected Radio Sources in the COSMOS Field

TL;DR

This work tackles the problem of distinguishing star-formation-dominated and accretion-dominated radio sources in the MIGHTEE-COSMOS field using supervised ML. By comparing five classifiers on 18 multi-wavelength features and optimizing feature sets, the authors find that a five-feature combination centered on the infrared-radio correlation , optical compactness, stellar mass, and two IRAC colours yields high -scores () even with limited training data, with -NN providing the best overall performance and stability. They perform extensive feature analyses (one- and two-dimensional, permutation, sequential importance, ROC curves) and demonstrate that including MIR colours is beneficial, while dimensionality reduction or feature scaling offers limited or negative gains. The results imply that ML-based SFG/AGN classification can be robustly applied to upcoming large radio surveys (e.g., SKA-era) and can operate effectively with incomplete X-ray/VLBI information, thereby facilitating rapid, scalable analyses of vast radio-continuum datasets. The study also highlights practical considerations such as feature completeness and the trade-offs between including more features versus maintaining a complete dataset.

Abstract

Radio synchrotron emission originates from both massive star formation and black hole accretion, two processes that drive galaxy evolution. Efficient classification of sources dominated by either process is therefore essential for fully exploiting deep, wide-field extragalactic radio continuum surveys. In this study, we implement, optimize, and compare five widely used supervised machine-learning (ML) algorithms to classify radio sources detected in the MeerKAT International GHz Tiered Extragalactic Exploration (MIGHTEE)-COSMOS survey as star-forming galaxies (SFGs) and active galactic nuclei (AGN). Training and test sets are constructed from conventionally classified MIGHTEE-COSMOS sources, and 18 physical parameters of the MIGHTEE-detected sources are evaluated as input features. As anticipated, our feature analyses rank the five parameters used in conventional classification as the most effective: the infrared-radio correlation parameter (), the optical compactness morphology parameter (classstar), stellar mass, and two combined mid-infrared colors. By optimizing the ML models with these selected features and testing classifiers across various feature combinations, we find that model performance generally improves as additional features are incorporated. Overall, all five algorithms yield an -score (the harmonic mean of precision and recall) even when trained on only of the dataset. Among them, the distance-based -nearest neighbors classifier demonstrates the highest accuracy and stability, establishing it as a robust and effective method for classifying SFGs and AGN in upcoming large radio continuum surveys.

Paper Structure

This paper contains 31 sections, 4 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: The bar plot illustrates the completeness of the overall classification (total) and the completeness for each diagnostic method of MIGHTEE-COSMOS detected radio sources. The categories of sources are colour-coded, with AGN shown in blue, SFGs in red, probable SFGs in light red, radio quiet (RQ) in light blue, non-AGN in yellow, and unclassified sources in grey.
  • Figure 2: Histograms (top) and Kolmogorov–Smirnov (K-S) test results (bottom) for AGN (blue) and SFGs (red), in the MIGHTEE-COSMOS catalog. Among the 18 parameters considered for selecting input features for ML, these six exhibit the highest significance based on the K-S statistic. The K-S value for each feature is displayed in the bottom-right corner of each panel. The features are sorted by the significance level of the K-S statistic (from left to right, top to bottom). In the top panels, vertical dashed lines indicate the mean of each distribution. In the bottom panels, the Y-axis shows the cumulative distribution function (CDF), with vertical dashed lines marking the point of maximum separation between the two distributions.
  • Figure 3: Feature correlation plots for pairs of features selected for classifying SFGs and AGN in the MIGHTEE-COSMOS radio source. Figure \ref{['fig:4.3e']} shows two-dimensional feature space generated by t-SNE. The open red circles represent SFGs, while the blue dots represent AGN. The solid red ellipses outline the 95% confidence for SFGs, while the dashed blue ellipses represent the 95% confidence for AGN. The orientation and shape of each ellipse represent the strength and direction of the correlation between the paired features and the corresponding galaxy classifications.
  • Figure 4: Feature importance estimated by the Permutation ( left) and RF ( right) models. The importance in the Permutation model is derived from the mean scores based on 1000 permutations. For the RF model, importance is computed by measuring the reduction in impurity within a decision tree node when a specific feature is used to split the data. The evaluation metric used is the $F1$-score.
  • Figure 5: Receiver Operating Characteristic (ROC) curves for the five selected features, namely, the $q_\mathrm{IR}$ (blue), class$\_$star (green), log$(M_{\rm star})$ (red), log($S_{8.0}$/$S_{4.5}$) (grey), and log($S_{5.8}$/$S_{3.6}$) (yellow).
  • ...and 11 more figures