Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models
Joan Font-Quer Roset, Devina Mohan, Anna Scaife
TL;DR
This work tackles the challenge of quantifying the intrinsic dimensionality of the high-dimensional Radio Galaxy Zoo data when labelled examples are scarce. It employs a score-based diffusion model to estimate intrinsic dimension (iD) and relates these estimates to Bayesian neural network energy scores that indicate how out-of-distribution a source is relative to the MiraBest benchmark. Compared with classical estimators (MLE, LPCA, PPCA), diffusion-based iD yields larger values, with out-of-distribution RGZ sources exhibiting higher iD and no strong FR I/II differences observed, alongside a weak SNR–iD relationship. The paper highlights the potential of using iD–energy score relationships to improve self-supervised representations and to guide compression strategies for RGZ subsets in future research.
Abstract
In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.
