From Simulations to Surveys: Domain Adaptation for Galaxy Observations

Kaley Brauer; Aditya Prasad Dash; Meet J. Vyas; Ahmed Salim; Stiven Briand Massala

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

Kaley Brauer, Aditya Prasad Dash, Meet J. Vyas, Ahmed Salim, Stiven Briand Massala

TL;DR

The paper tackles the challenge of transferring galaxy morphology inferences from simulated images to real survey data by framing it as a covariate-shift problem and proposing a domain-adaptation pipeline trained on TNG50 SKIRT simulations and evaluated on SDSS Galaxy Zoo labels. It combines three backbones (CNN, $E(2)$-steerable CNN, and ResNet-18), a focal loss with effective-number class weights, and a suite of domain-alignment losses from GeomLoss, including entropic Sinkhorn OT, energy distance, and Gaussian MMD, augmented with a top-$k$ soft matching term to focus on worst-aligned pairs. The results show substantial gains on the target domain, with the best Euclidean-distance approach achieving $87.3\%$ accuracy and a macro-F1 of $0.626$, compared to a baseline of $46.8\%$ accuracy and $0.298$ macro-F1, and a domain AUC near $0.5$ indicating effective latent-space mixing. This work demonstrates a viable path toward robust sim-to-real domain adaptation for large galaxy surveys and outlines concrete steps for expanding labels, refining metrics, and testing alternative architectures to enhance cross-domain alignment.

Abstract

Large photometric surveys will image billions of galaxies, but we currently lack quick, reliable automated ways to infer their physical properties like morphology, stellar mass, and star formation rates. Simulations provide galaxy images with ground-truth physical labels, but domain shifts in PSF, noise, backgrounds, selection, and label priors degrade transfer to real surveys. We present a preliminary domain adaptation pipeline that trains on simulated TNG50 galaxies and evaluates on real SDSS galaxies with morphology labels (elliptical/spiral/irregular). We train three backbones (CNN, $E(2)$-steerable CNN, ResNet-18) with focal loss and effective-number class weighting, and a feature-level domain loss $L_D$ built from GeomLoss (entropic Sinkhorn OT, energy distance, Gaussian MMD, and related metrics). We show that a combination of these losses with an OT-based "top_$k$ soft matching" loss that focuses $L_D$ on the worst-matched source-target pairs can further enhance domain alignment. With Euclidean distance, scheduled alignment weights, and top-$k$ matching, target accuracy (macro F1) rises from $\sim$46% ($\sim$30%) at no adaptation to $\sim$87% ($\sim$62.6%), with a domain AUC near 0.5, indicating strong latent-space mixing.

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

TL;DR

-steerable CNN, and ResNet-18), a focal loss with effective-number class weights, and a suite of domain-alignment losses from GeomLoss, including entropic Sinkhorn OT, energy distance, and Gaussian MMD, augmented with a top-

soft matching term to focus on worst-aligned pairs. The results show substantial gains on the target domain, with the best Euclidean-distance approach achieving

accuracy and a macro-F1 of

, compared to a baseline of

accuracy and

macro-F1, and a domain AUC near

indicating effective latent-space mixing. This work demonstrates a viable path toward robust sim-to-real domain adaptation for large galaxy surveys and outlines concrete steps for expanding labels, refining metrics, and testing alternative architectures to enhance cross-domain alignment.

Abstract

-steerable CNN, ResNet-18) with focal loss and effective-number class weighting, and a feature-level domain loss

built from GeomLoss (entropic Sinkhorn OT, energy distance, Gaussian MMD, and related metrics). We show that a combination of these losses with an OT-based "top_

soft matching" loss that focuses

on the worst-matched source-target pairs can further enhance domain alignment. With Euclidean distance, scheduled alignment weights, and top-

matching, target accuracy (macro F1) rises from

46% (

30%) at no adaptation to

87% (

62.6%), with a domain AUC near 0.5, indicating strong latent-space mixing.

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

TL;DR

Abstract

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)