Machine Learning Frameworks for Large-Scale Radio Surveys: A Summary of Recent Studies
Nikhel Gupta
TL;DR
This work surveys the application of diverse machine learning strategies to the EMU radio survey data, spanning supervised, unsupervised, self-supervised, and weakly-supervised approaches. It introduces the RG-CAT pipeline and RadioGalaxyNET for large-scale radio galaxy detection and cataloging, with Gal-DINO achieving strong cross-modal detections of radio galaxies and infrared hosts. Self-supervised multimodal learning via OpenCLIP enables zero-shot classification and fast retrieval through the EMUSE search engine, while SOM-based unsupervised methods extend discovery of rare morphologies and ORCs. Weakly-supervised CAM-based segmentation demonstrates effective pixel-level localization using only coarse labels. Collectively, these methods accelerate analysis pipelines, improve catalog completeness, and prepare EMU for the forthcoming SKA era by enabling rapid discovery of novel radio phenomena.
Abstract
The rapid growth of large-scale radio surveys, generating over 100 petabytes of data annually, has created a pressing need for automated data analysis methods. Recent research has explored the application of machine learning techniques to address the challenges associated with detecting and classifying radio galaxies, as well as discovering peculiar radio sources. This paper provides an overview of our investigations with the Evolutionary Map of the Universe (EMU) survey, detailing the methodologies employed-including supervised, unsupervised, self-supervised, and weakly supervised learning approaches -- and their implications for ongoing and future radio astronomical surveys.
