Table of Contents
Fetching ...

A targeted machine learning approach for detecting diffuse radio emission with Astronomaly: Protege

Verlon Etsebeth, Michelle Lochner, Konstantinos Kolokythas, Kenda Knowles, Emma Tolley

TL;DR

This work shows that Protege can identify diffuse emission with minimal human labelling effort, offering a powerful, scalable tool capable of detecting both known and novel diffuse radio sources.

Abstract

Diffuse radio emission in galaxy clusters, such as radio halos, relics, and mini halos, is a key tracer of non-thermal processes, turbulence, and magnetic fields within the intra-cluster medium. However, their low surface brightness, as well as contamination from compact sources and imaging artefacts, makes their detection challenging. The sheer volume of data from instruments such as the Square Kilometre Array will render traditional manual-inspection based detection methods infeasible. This paper introduces a novel machine learning approach that uses active learning to rapidly identify diffuse emission candidates from a small, optimally-selected subset of data. We apply the self-supervised deep learning algorithm Bootstrap Your Own Latent to extract features from source cutouts in the MeerKAT Galaxy Cluster Legacy Survey (MGCLS). We then pass these features through the Astronomaly: Protege anomaly detection framework to identify the final candidates. Using a human-labelled set, we evaluate our pipeline on high-resolution (~7''), convolved (15''), and combined-feature MGCLS datasets. Interestingly, the high-resolution features identify diffuse sources more efficiently than the convolved resolution, which are in turn outperformed by the combined features. Of the top 100 sources ranked by Protege, 99% exhibit diffuse characteristics, with 55% confirmed as cluster-related emission. Our work shows that Protege can identify diffuse emission with minimal human labelling effort, offering a powerful, scalable tool capable of detecting both known and novel diffuse radio sources.

A targeted machine learning approach for detecting diffuse radio emission with Astronomaly: Protege

TL;DR

This work shows that Protege can identify diffuse emission with minimal human labelling effort, offering a powerful, scalable tool capable of detecting both known and novel diffuse radio sources.

Abstract

Diffuse radio emission in galaxy clusters, such as radio halos, relics, and mini halos, is a key tracer of non-thermal processes, turbulence, and magnetic fields within the intra-cluster medium. However, their low surface brightness, as well as contamination from compact sources and imaging artefacts, makes their detection challenging. The sheer volume of data from instruments such as the Square Kilometre Array will render traditional manual-inspection based detection methods infeasible. This paper introduces a novel machine learning approach that uses active learning to rapidly identify diffuse emission candidates from a small, optimally-selected subset of data. We apply the self-supervised deep learning algorithm Bootstrap Your Own Latent to extract features from source cutouts in the MeerKAT Galaxy Cluster Legacy Survey (MGCLS). We then pass these features through the Astronomaly: Protege anomaly detection framework to identify the final candidates. Using a human-labelled set, we evaluate our pipeline on high-resolution (~7''), convolved (15''), and combined-feature MGCLS datasets. Interestingly, the high-resolution features identify diffuse sources more efficiently than the convolved resolution, which are in turn outperformed by the combined features. Of the top 100 sources ranked by Protege, 99% exhibit diffuse characteristics, with 55% confirmed as cluster-related emission. Our work shows that Protege can identify diffuse emission with minimal human labelling effort, offering a powerful, scalable tool capable of detecting both known and novel diffuse radio sources.
Paper Structure (29 sections, 11 figures, 1 table)

This paper contains 29 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Examples of some of the known diffuse cluster emission in this study. From left to right: relic, halo, mini halo, and phoenix, ordered according to their frequency in the MGCLS catalogue. The beam solid angle is shown in the bottom-left corner of each cutout.
  • Figure 2: Comparison of data cuts applied to MGCLS sources. Left: Gaussian component-based cut from lochner2024. Right: Beam-size cut proposed here. Plots show the number of sources (solid lines) and tracers (dashed lines) remaining in high-resolution (orange) and convolved (blue) versus cut threshold (Gaussian components or beam-size multiples). The vertical lines mark the selected thresholds. It is evident that the beam-size cut retains more tracers for similar subset sizes.
  • Figure 3: Top 100 ranked sources from the concatenated high-resolution subset (see \ref{['subsec: High Resolution versus Convolved']}). Known tracers are highlighted with blue frames. The beam solid angle is shown in the bottom-left corner of each cutout. 99 of the 100 sources represent some form of diffuse and/or extended radio emission, highlighting the performance of the algorithm. The other source corresponds to an artefact. A more detailed analysis of these sources is presented in \ref{['subsec: Top 100 Investigation']}.
  • Figure 4: Cumulative anomaly curves for the high-resolution dataset (on the left with 121 tracers in total) and the convolved dataset (on the right with 119 tracers in total), showing the cumulative number of tracers recovered within the ranked list for different labelling iterations. The "$0$ labels" curve corresponds to the initial distribution of tracers prior to labelling. Both datasets show significant improvements in the cumulative number of tracers with iterative labelling, although diminishing returns are observed in later iterations.
  • Figure 5: UMAP visualisations of the high resolution (left) and convolved (right) feature spaces. Red points indicate the tracers and grey points represent all other sources. In both cases, the tracers form several groupings with some sources dispersed across the feature space, presenting a challenge for detection with machine learning.
  • ...and 6 more figures