Table of Contents
Fetching ...

STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi

Abstract

Next-generation radio astronomy surveys are producing millions of resolved sources, but robust morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for transferable radio astronomy image encoders. STRADAViT combines a mixed-survey pretraining dataset, radio astronomy-aware view generation, and controlled continued pretraining through reconstruction-only, contrastive-only, and two-stage branches. Pretraining uses 512x512 radio astronomy cutouts from MeerKAT, ASKAP, LOFAR/LoTSS, and SKA data. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks: MiraBest, LoTSS DR2, and Radio Galaxy Zoo. Relative to the initialization used for continued pretraining, the best two-stage STRADAViT models improve Macro-F1 in all reported linear-probe settings and in most fine-tuning settings, with the largest gain on RGZ DR1. Relative to strong DINOv2 baselines, gains are selective but remain positive on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2-initialized HCL ablation further shows that the adaptation recipe is not specific to a single starting point. The released STRADAViT checkpoint remains the preferred model because it offers competitive transfer at lower token count and downstream cost than the DINOv2-based alternative. These results show that radio astronomy-aware view generation and staged continued pretraining provide a stronger starting point than out-of-the-box Vision Transformers for radio astronomy transfer.

STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

Abstract

Next-generation radio astronomy surveys are producing millions of resolved sources, but robust morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for transferable radio astronomy image encoders. STRADAViT combines a mixed-survey pretraining dataset, radio astronomy-aware view generation, and controlled continued pretraining through reconstruction-only, contrastive-only, and two-stage branches. Pretraining uses 512x512 radio astronomy cutouts from MeerKAT, ASKAP, LOFAR/LoTSS, and SKA data. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks: MiraBest, LoTSS DR2, and Radio Galaxy Zoo. Relative to the initialization used for continued pretraining, the best two-stage STRADAViT models improve Macro-F1 in all reported linear-probe settings and in most fine-tuning settings, with the largest gain on RGZ DR1. Relative to strong DINOv2 baselines, gains are selective but remain positive on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2-initialized HCL ablation further shows that the adaptation recipe is not specific to a single starting point. The released STRADAViT checkpoint remains the preferred model because it offers competitive transfer at lower token count and downstream cost than the DINOv2-based alternative. These results show that radio astronomy-aware view generation and staged continued pretraining provide a stronger starting point than out-of-the-box Vision Transformers for radio astronomy transfer.

Paper Structure

This paper contains 58 sections, 17 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Reconstruction-branch masked-reconstruction views. The first column shows the standardized parent cutout (after the per-image ZScale contrast stretch described in Section \ref{['ssl_data']}). Subsequent columns show example ROI-aligned crops produced by our single-view strategy (Section \ref{['phase1']}); with probability $p_{\mathrm{global}}=0.2$ the strategy instead samples a wide-field crop from the full cutout. The selected view is then transformed by progressively applied, morphology-preserving augmentations.
  • Figure 2: Contrastive-branch multi-view examples produced by our on-the-fly augmenter. The first column shows the standardized parent cutout. Each subsequent pair of columns illustrates two correlated views sampled from the same cutout and anchored to the same object-centric ROI (Section \ref{['phase2']}): a wider/global view with mild, morphology-preserving augmentations and an additional view with stronger corruptions. For training we sample exactly $V=2$ views per cutout per forward pass, so each step uses one such pair per cutout; the multiple pairs shown here are independent draws for visualization. In the contrastive loss, the two views of the same cutout form the positives, while all other views in the (distributed) micro-batch act as negatives.
  • Figure 3: Example cutouts from the three evaluation datasets after preprocessing with the same pipeline used for pretraining and evaluation (Section \ref{['ssl_data']}). Panels show samples of each class for (a) RGZ DR1, (b) MiraBest, and (c) the LoTSS DR2 visual-classification sample of Horton2025LoTSSDR2Morphology.
  • Figure 4: Aggregate recall-form confusion matrices (%) for MiraBest, LoTSS DR2, and RGZ DR1. Panel (a) shows the linear-probe comparison and panel (b) shows the full fine-tuning comparison. In both panels, the selected STRADAViT configuration is compared against the two starting-point baselines, ViT-Mae and DINOv2 (Registers). Rows denote true classes and columns denote predicted classes.