Table of Contents
Fetching ...

Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models

Joan Font-Quer Roset, Devina Mohan, Anna Scaife

TL;DR

This work tackles the challenge of quantifying the intrinsic dimensionality of the high-dimensional Radio Galaxy Zoo data when labelled examples are scarce. It employs a score-based diffusion model to estimate intrinsic dimension (iD) and relates these estimates to Bayesian neural network energy scores that indicate how out-of-distribution a source is relative to the MiraBest benchmark. Compared with classical estimators (MLE, LPCA, PPCA), diffusion-based iD yields larger values, with out-of-distribution RGZ sources exhibiting higher iD and no strong FR I/II differences observed, alongside a weak SNR–iD relationship. The paper highlights the potential of using iD–energy score relationships to improve self-supervised representations and to guide compression strategies for RGZ subsets in future research.

Abstract

In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.

Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models

TL;DR

This work tackles the challenge of quantifying the intrinsic dimensionality of the high-dimensional Radio Galaxy Zoo data when labelled examples are scarce. It employs a score-based diffusion model to estimate intrinsic dimension (iD) and relates these estimates to Bayesian neural network energy scores that indicate how out-of-distribution a source is relative to the MiraBest benchmark. Compared with classical estimators (MLE, LPCA, PPCA), diffusion-based iD yields larger values, with out-of-distribution RGZ sources exhibiting higher iD and no strong FR I/II differences observed, alongside a weak SNR–iD relationship. The paper highlights the potential of using iD–energy score relationships to improve self-supervised representations and to guide compression strategies for RGZ subsets in future research.

Abstract

In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.

Paper Structure

This paper contains 7 sections, 2 theorems, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2.1

For any point ${\mathbf{x}} \in {\mathbb{R}}^d$ sufficiently close to a compact embedded manifold ${\mathcal{M}}$, and a sufficiently small diffusion time $t$, the score vector $\nabla_{{\mathbf{x}}} \ln(p_t ({\mathbf{x}}))$ points directly at the projection of ${\mathbf{x}}$ on the manifold.

Figures (5)

  • Figure 1: Example images from the Radio Galaxy Zoo dataset taken from different intervals of the mean energy distribution.
  • Figure 2: Intrinsic dimension estimates for the RGZ dataset as a function of the (a) mean interval and (b) standard deviation interval of the energy distribution from the Hamiltonian Monte Carlo based Bayesian neural network trained on the MiraBest dataset.
  • Figure 3: Signal to noise ratio versus the intrinsic dimension for a subset of galaxies labelled by Fanaroff-Riley class. Both axes are shown in logarithmic scale.
  • Figure 4: Images from the Radio Galaxy Zoo dataset from different intervals of the standard deviation of energy scores.
  • Figure 5: Score spectrum plots for up to 100 randomly selected RGZ sources from different (a) mean and (b) standard deviation intervals of the energy distribution. Here, the x-axis shows the singular values which go up to the ambient dimension (72x72 for RGZ images) and the y-axis shows the magnitude of each singular value. The point at which there is a sharp drop in singular values indicates the normal dimension. This can be subtracted from the ambient dimension to calculate the intrinsic dimension.

Theorems & Definitions (2)

  • Theorem 2.1
  • Corollary 2.1.1