From Simulations to Surveys: Domain Adaptation for Galaxy Observations
Kaley Brauer, Aditya Prasad Dash, Meet J. Vyas, Ahmed Salim, Stiven Briand Massala
TL;DR
The paper tackles the challenge of transferring galaxy morphology inferences from simulated images to real survey data by framing it as a covariate-shift problem and proposing a domain-adaptation pipeline trained on TNG50 SKIRT simulations and evaluated on SDSS Galaxy Zoo labels. It combines three backbones (CNN, $E(2)$-steerable CNN, and ResNet-18), a focal loss with effective-number class weights, and a suite of domain-alignment losses from GeomLoss, including entropic Sinkhorn OT, energy distance, and Gaussian MMD, augmented with a top-$k$ soft matching term to focus on worst-aligned pairs. The results show substantial gains on the target domain, with the best Euclidean-distance approach achieving $87.3\%$ accuracy and a macro-F1 of $0.626$, compared to a baseline of $46.8\%$ accuracy and $0.298$ macro-F1, and a domain AUC near $0.5$ indicating effective latent-space mixing. This work demonstrates a viable path toward robust sim-to-real domain adaptation for large galaxy surveys and outlines concrete steps for expanding labels, refining metrics, and testing alternative architectures to enhance cross-domain alignment.
Abstract
Large photometric surveys will image billions of galaxies, but we currently lack quick, reliable automated ways to infer their physical properties like morphology, stellar mass, and star formation rates. Simulations provide galaxy images with ground-truth physical labels, but domain shifts in PSF, noise, backgrounds, selection, and label priors degrade transfer to real surveys. We present a preliminary domain adaptation pipeline that trains on simulated TNG50 galaxies and evaluates on real SDSS galaxies with morphology labels (elliptical/spiral/irregular). We train three backbones (CNN, $E(2)$-steerable CNN, ResNet-18) with focal loss and effective-number class weighting, and a feature-level domain loss $L_D$ built from GeomLoss (entropic Sinkhorn OT, energy distance, Gaussian MMD, and related metrics). We show that a combination of these losses with an OT-based "top_$k$ soft matching" loss that focuses $L_D$ on the worst-matched source-target pairs can further enhance domain alignment. With Euclidean distance, scheduled alignment weights, and top-$k$ matching, target accuracy (macro F1) rises from $\sim$46% ($\sim$30%) at no adaptation to $\sim$87% ($\sim$62.6%), with a domain AUC near 0.5, indicating strong latent-space mixing.
