Table of Contents
Fetching ...

Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

Chenrui Ma, Zechang Sun, Tao Jing, Zheng Cai, Yuan-Sen Ting, Song Huang, Mingyu Li

TL;DR

GalaxySD presents a conditional diffusion framework trained on Galaxy Zoo 2 to synthesize morphology-conditioned galaxy images for data augmentation. By leveraging cross-attention and feature-weighted prompts, it can extrapolate to rare or unseen morphologies, such as dusty early-type galaxies, while maintaining realism and diversity. The synthetic data yield tangible improvements in classical morphology classification (up to 30% in purity/completeness) and enable discovery of 520 additional dusty early-type galaxies, doubling previous counts. This work demonstrates the practical value of generative models for data augmentation and exploratory science in large astronomical surveys and lays groundwork for future astrophysical foundation-model developments.

Abstract

Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets, whether from simulations or human annotation, a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data (hereafter GalaxySD). Leveraging the Galaxy Zoo 2 dataset which contains visual feature, galaxy image pairs from volunteer annotation, we demonstrate that GalaxySD generates diverse, high-fidelity galaxy images that closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features (~0.1% in GZ2 dataset) as a test case, our approach doubled the number of detected instances, from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/.

Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

TL;DR

GalaxySD presents a conditional diffusion framework trained on Galaxy Zoo 2 to synthesize morphology-conditioned galaxy images for data augmentation. By leveraging cross-attention and feature-weighted prompts, it can extrapolate to rare or unseen morphologies, such as dusty early-type galaxies, while maintaining realism and diversity. The synthetic data yield tangible improvements in classical morphology classification (up to 30% in purity/completeness) and enable discovery of 520 additional dusty early-type galaxies, doubling previous counts. This work demonstrates the practical value of generative models for data augmentation and exploratory science in large astronomical surveys and lays groundwork for future astrophysical foundation-model developments.

Abstract

Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets, whether from simulations or human annotation, a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data (hereafter GalaxySD). Leveraging the Galaxy Zoo 2 dataset which contains visual feature, galaxy image pairs from volunteer annotation, we demonstrate that GalaxySD generates diverse, high-fidelity galaxy images that closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features (~0.1% in GZ2 dataset) as a test case, our approach doubled the number of detected instances, from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/.

Paper Structure

This paper contains 23 sections, 13 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Schematic diagram of GalaxySD, evaluation of synthetic images, and applications to downstream tasks. For the training processes shown in the upper panel, the training set consists of real galaxy images and descriptive morphological labels (refer to Figure \ref{['fig:sdss_distribution']}) from Galaxy Zoo 2, which are used to fine-tuning (abbreviated as FT) diffusion model. In the diffusion model, ${x_0}$ denotes the original galaxy image, ${x_T}$ the noised image sampled from a normal distribution, and ${x_0}'$ the reconstructed galaxy image predicted by the neural network conditioned on ${x_T}$ and the input text. For inference, given morphological prompts (refer to Table \ref{['tab:prompts']}), our fine-tuned model, GalaxySD generates realistic and diverse galaxy images as demanded. These generated images are evaluated by three metrics-realism, diversity and consistency as described in Section \ref{['subsec:visual']}. In practical applications, these simulated galaxy images, integrated with limited real data, could form a high-quality augmented dataset for training a binary classifier, achieving better performance (we use an upward arrow to denote this) than other ML methods and found new galaxies in Galaxy Zoo SDSS, as detailed in Sections \ref{['subsec:classical']} and \ref{['subsec:few-shot']}.
  • Figure 2: Upset plot to visulize the morphology distribution in training samples. The gray dots and connections represents co-occurrence of different morphological features in the dataset. Bars indicate the counts of individual features and their combinations. We simply show primary tag categories, excluding detailed tags such as bulge prominence, the number of spiral arms, etc. This highly imbalanced distributions emphasize the necessity of incorporating synthetic data for robust machine learning training.
  • Figure 3: Comparison of galaxy images generated by GalaxySD under various morphology-related text prompts, compared with real galaxy images. For each column annotated by a capital letter in top-right corner, the leading first image is a real one from SDSS, while the following three images are synthetic. The annotated capital letters represents prompts used to generate simulated galaxies as Table \ref{['tab:prompts']} shows.
  • Figure 4: The quantified consistency, diversity and realism of generated galaxy images with training steps increasing, under average prompts. The benchmark line in the step vs. consistency figure is $y=0.895$, which denotes the consistency of real galaxy images under average prompts. The lower panel intuitively demonstrates that under various prompts, the more training steps evolve, the more real generated galaxy images are.
  • Figure 5: Pareto fronts showing the trade-off relationship between realism and diversity evaluation indicators of our conditional diffusion model. The dashed lines are Pareto fronts representing the frontiers of optimal solutions to the three evaluation metrics. Different colors represent different types of morphological prompts as top legend shows.
  • ...and 12 more figures