Table of Contents
Fetching ...

astromorph: Self-supervised machine learning pipeline for astronomical morphology analysis

Per Bjerkeli, Jouni Kainulainen, Maria Carmen Toribio, Leon Boschman, Otoniel Maya Lucas

TL;DR

Astromorph addresses the need to organize and interpret large, imaging-rich astronomical datasets without labeled data by implementing a self-supervised BYOL-based pipeline tailored for astronomy. It integrates a BYOL framework into a user-friendly package that supports variable data dimensions, including single-channel FITS images and multi-channel spectral cubes, and provides a lightweight CNN for accessible training. The paper demonstrates two science cases—ALMA protoplanetary disks and infrared dark clouds from Spitzer/Herschel—showing that the learned embeddings capture morphology and enable clustering, similarity search, and exploratory analysis. The approach offers a practical, scalable tool for discovery in observational astronomy and is poised to extend to JWST data and 3D data cubes.

Abstract

Modern telescopes generate increasingly large and diverse datasets, often consisting of complex and morphologically rich structures. To efficiently explore such data requires automated methods that can extract and organize physically meaningful information, ideally without the need for extensive manual interaction. We aim to provide a user-friendly implementation of a self-supervised machine learning framework to explore morphological properties of large datasets, based on the BYOL (Bootstrap Your Own Latents) method. By enabling the generation of meaningful image embeddings without manually labelled data, the framework will enable key tasks such as clustering, anomaly detection, and similarity based exploration. In contrast to existing BYOL implementations, astromorph accommodates data of varying dimensions and resolutions, including both single-channel FITS images and multi-channel spectral cubes. The package is built with usability in mind, offering streamlined pipeline scripts for ease of use as well as deeper customization options via PyTorch-based classes. To demonstrate the utility of astromorph, we apply it in two contrasting science cases representing different astronomical domains: images of protoplanetary disks observed with ALMA, and infrared dark clouds observed with Spitzer and Herschel. In both cases, we demonstrate how astromorph produces scientifically meaningful embeddings that capture morphological differences and similarities across large samples. astromorph enables users to apply a robust, label-free approach for uncovering morphological patterns in astronomical datasets. The successful application to two markedly different datasets suggest that the pipeline is broadly applicable across a wide range of imaging-rich astronomical context, providing a user friendly tool for advancing discovery in observational astronomy.

astromorph: Self-supervised machine learning pipeline for astronomical morphology analysis

TL;DR

Astromorph addresses the need to organize and interpret large, imaging-rich astronomical datasets without labeled data by implementing a self-supervised BYOL-based pipeline tailored for astronomy. It integrates a BYOL framework into a user-friendly package that supports variable data dimensions, including single-channel FITS images and multi-channel spectral cubes, and provides a lightweight CNN for accessible training. The paper demonstrates two science cases—ALMA protoplanetary disks and infrared dark clouds from Spitzer/Herschel—showing that the learned embeddings capture morphology and enable clustering, similarity search, and exploratory analysis. The approach offers a practical, scalable tool for discovery in observational astronomy and is poised to extend to JWST data and 3D data cubes.

Abstract

Modern telescopes generate increasingly large and diverse datasets, often consisting of complex and morphologically rich structures. To efficiently explore such data requires automated methods that can extract and organize physically meaningful information, ideally without the need for extensive manual interaction. We aim to provide a user-friendly implementation of a self-supervised machine learning framework to explore morphological properties of large datasets, based on the BYOL (Bootstrap Your Own Latents) method. By enabling the generation of meaningful image embeddings without manually labelled data, the framework will enable key tasks such as clustering, anomaly detection, and similarity based exploration. In contrast to existing BYOL implementations, astromorph accommodates data of varying dimensions and resolutions, including both single-channel FITS images and multi-channel spectral cubes. The package is built with usability in mind, offering streamlined pipeline scripts for ease of use as well as deeper customization options via PyTorch-based classes. To demonstrate the utility of astromorph, we apply it in two contrasting science cases representing different astronomical domains: images of protoplanetary disks observed with ALMA, and infrared dark clouds observed with Spitzer and Herschel. In both cases, we demonstrate how astromorph produces scientifically meaningful embeddings that capture morphological differences and similarities across large samples. astromorph enables users to apply a robust, label-free approach for uncovering morphological patterns in astronomical datasets. The successful application to two markedly different datasets suggest that the pipeline is broadly applicable across a wide range of imaging-rich astronomical context, providing a user friendly tool for advancing discovery in observational astronomy.
Paper Structure (28 sections, 1 equation, 5 figures, 1 table)

This paper contains 28 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Process diagram of the astromorph package. The pipeline_training.py script builds and trains a CNN model using the BYOL framework, with configuration defined in example'_settings.toml. The BYOL architecture (upper right) operates on two augmented views on the input to learn meaningful representations without labels. The script pipeline_inference.py uses the trained model to compute and store embeddings that can be used for further scientific exploration.
  • Figure 2: t-SNE projection of embedding vectors for ALMA continuum images selected from the ASA. Each point is represented by a thumbnail of the corresponding FITS image. For clarity, low-emission regions have been masked using an adaptive threshold derived from a Gaussian fit to the core of the distribution of normalized pixel values. Sources with stronger emission appear on top of weaker ones in the plot to enhance visibility.
  • Figure 3: ALMA continuum image of GG Tau (upper left) and the six most morphologically similar images, identified using cosine similarity. Each thumbnail shows the flux density in Jy beam$^{-1}$. Lower right panel presents a 2D PCA projection of embedding vectors, with the selected object and its closest matches highlighted.
  • Figure 4: Same as Fig. \ref{['fig:proj_ALMA']} but for cloud data obtained with Spitzer.
  • Figure 5: Principal component analysis of cloud morphologies. Upper panel: PC1 vs. PC2, where PC1 primarily seems to trace cloud size and PC2 reflects the distribution of the emission. Lower panel: PC2 vs. PC3 where PC2 again represent emission distribution, while PC3 seems to capture some measure of morphological complexity.