Table of Contents
Fetching ...

The Importance of Being Adaptable: An Exploration of the Power and Limitations of Domain Adaptation for Simulation-Based Inference with Galaxy Clusters

Michelle Ntampaka, A. Ciprijanovic, Ana Maria Delgado, John Soltis, John F. Wu, Mikaeel Yunus, John ZuHone

TL;DR

This work addresses domain shift in simulation-based inference for galaxy-cluster masses by constructing a Magneticum-trained dataset, a scatter-augmented variant to capture scaling-uncertainty, and a distinct IllustrisTNG Test Set with realistic X-ray mocks. It compares three deep-learning strategies—a standard NN, a Scatter-Augmented NN (SANN), and a semi-supervised Deep Reconstruction-Regression Network (DRRN)—across $M$-$M_{ ext{gas}}$, $M$-$T$, and $M$-$Y_X$ scaling relations. The NN improves training performance by about 17% but degrades by roughly 40% on the out-of-domain Test Set; the SANN shows similar degradation; the DRRN aligns training and test data in a shared latent space yet underperforms a straightforward $Y_X$ proxy, highlighting persistent biases from domain shift. The findings underscore the fragility of simulation-based inference under subtle domain differences and emphasize the need for careful calibration and robust domain-aware approaches before applying models to real observational data.

Abstract

The application of deep machine learning methods in astronomy has exploded in the last decade, with new models showing remarkably improved performance on benchmark tasks. Not nearly enough attention is given to understanding the models' robustness, especially when the test data are systematically different from the training data, or "out of domain." Domain shift poses a significant challenge for simulation-based inference, where models are trained on simulated data but applied to real observational data. In this paper, we explore domain shift and test domain adaptation methods for a specific scientific case: simulation-based inference for estimating galaxy cluster masses from X-ray profiles. We build datasets to mimic simulation-based inference: a training set from the Magneticum simulation, a scatter-augmented training set to capture uncertainties in scaling relations, and a test set derived from the IllustrisTNG simulation. We demonstrate that the Test Set is out of domain in subtle ways that would be difficult to detect without careful analysis. We apply three deep learning methods: a standard neural network (NN), a neural network trained on the scatter-augmented input catalogs, and a Deep Reconstruction-Regression Network (DRRN), a semi-supervised deep model engineered to address domain shift. Although the NN improves results by 17% in the Training Data, it performs 40% worse on the out-of-domain Test Set. Surprisingly, the Scatter-Augmented Neural Network (SANN) performs similarly. While the DRRN is successful in mapping the training and Test Data onto the same latent space, it consistently underperforms compared to a straightforward Yx scaling relation. These results serve as a warning that simulation-based inference must be handled with extreme care, as subtle differences between training simulations and observational data can lead to unforeseen biases creeping into the results.

The Importance of Being Adaptable: An Exploration of the Power and Limitations of Domain Adaptation for Simulation-Based Inference with Galaxy Clusters

TL;DR

This work addresses domain shift in simulation-based inference for galaxy-cluster masses by constructing a Magneticum-trained dataset, a scatter-augmented variant to capture scaling-uncertainty, and a distinct IllustrisTNG Test Set with realistic X-ray mocks. It compares three deep-learning strategies—a standard NN, a Scatter-Augmented NN (SANN), and a semi-supervised Deep Reconstruction-Regression Network (DRRN)—across -, -, and - scaling relations. The NN improves training performance by about 17% but degrades by roughly 40% on the out-of-domain Test Set; the SANN shows similar degradation; the DRRN aligns training and test data in a shared latent space yet underperforms a straightforward proxy, highlighting persistent biases from domain shift. The findings underscore the fragility of simulation-based inference under subtle domain differences and emphasize the need for careful calibration and robust domain-aware approaches before applying models to real observational data.

Abstract

The application of deep machine learning methods in astronomy has exploded in the last decade, with new models showing remarkably improved performance on benchmark tasks. Not nearly enough attention is given to understanding the models' robustness, especially when the test data are systematically different from the training data, or "out of domain." Domain shift poses a significant challenge for simulation-based inference, where models are trained on simulated data but applied to real observational data. In this paper, we explore domain shift and test domain adaptation methods for a specific scientific case: simulation-based inference for estimating galaxy cluster masses from X-ray profiles. We build datasets to mimic simulation-based inference: a training set from the Magneticum simulation, a scatter-augmented training set to capture uncertainties in scaling relations, and a test set derived from the IllustrisTNG simulation. We demonstrate that the Test Set is out of domain in subtle ways that would be difficult to detect without careful analysis. We apply three deep learning methods: a standard neural network (NN), a neural network trained on the scatter-augmented input catalogs, and a Deep Reconstruction-Regression Network (DRRN), a semi-supervised deep model engineered to address domain shift. Although the NN improves results by 17% in the Training Data, it performs 40% worse on the out-of-domain Test Set. Surprisingly, the Scatter-Augmented Neural Network (SANN) performs similarly. While the DRRN is successful in mapping the training and Test Data onto the same latent space, it consistently underperforms compared to a straightforward Yx scaling relation. These results serve as a warning that simulation-based inference must be handled with extreme care, as subtle differences between training simulations and observational data can lead to unforeseen biases creeping into the results.

Paper Structure

This paper contains 9 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Sample clusters from the Training Set (top row) and Test Set (bottom row) at three masses ($\log\left[M/(h^{-1}M_\odot\right]\approx 14.00, 14.50, 14.80$). The Training Set is derived from the Magneticum simulation, at z=0.07, as observed for 100ks by a generic instrument with a flat instrument response. In contrast, the Test Set is derived from IllustrisTNG300 at z=0.05 with a simulated 100ks Chandra observation. Circles denote 0.15$R_\mathrm{500}$ (dotted), 0.50$R_\mathrm{500}$ (dashed), and 1.0$R_\mathrm{500}$ (solid). The differences between the Training and Test Sets are described in Tables \ref{['table:sim_compare']} and \ref{['table:obs_compare']}. Figures \ref{['fig:sully']}, \ref{['fig:powerlaw']}, and \ref{['fig:yx2']} illustrate further differences between these two data sets. The Training and Test Sets are visibly different and were constructed this way by design — because ML models can unfairly use simulation artifacts to infer underlying parameters, we built a completely unique Test Set to evaluate the robustness of our model to domain shift.
  • Figure 2: Top: Representative sample of density profiles for the Training (left) and Test (right) Sets. Profiles are colored by cluster mass. The Magneticum profiles that form the Training Set are shallower in the core and less self-similar in the outskirts, an effect that is likely due to feedback differences between Magneticum and IllustrisTNG. Bottom: Correlation matrices showing how these same density profiles are correlated across logarithmically spaced radial bins. The Magneticum profiles that comprise the Training Set have stronger positive correlations across disparate radial bins. The IllustrisTNG clusters that comprise the Test Set, however, have very strong correlations among bins inside of $r\sim0.67R_{500}$, while the outskirts of these clusters are anticorrelated with the inner regions.
  • Figure 3: Scatter associated with power law mass predictions for the $T-M_{500}$ (top), $M_\mathrm{gas}-M_{500}$ (middle), and $Y_X-M_{500}$ (bottom) scaling relations. The Training Set (orange with 1- and 2-$\sigma$ bands) and Test Set (purple error bars with 1-$\sigma$ error bars) do not follow the same scaling relations, nor do they have the same intrinsic scatter. The best-fit power law parameters for each simulation are compiled in Table \ref{['table:obs_compare']}. When Training Set scaling parameters are adopted, scaling relation predictions on the Test Set have a mass-dependent bias. In practice, this is accounted for by employing an unbiased proxy (such as weak lensing mass estimates) to calibrate the underlying bias.
  • Figure 4: The errors in gas-mass-based and temperature-based mass estimates of clusters in the Training and Test Sets. The left and right figures show the same information presented two different ways: on the left, scatter points, and on the right, ellipses capturing the inner $1-$ and $2-\sigma$ of the same data. In the Training Set (orange), the errors on mass estimates derived from gas mass ($\Delta\,M_\mathrm{gas}$) and those based on temperature ($\Delta\,T$) have anticorrelated errors, similar to the trend shown in Figure 4 of 2006ApJ...650..128K. This anticorrelation is the primary underlying trend that is exploited in $Y_X$ to produce a lower-scatter mass proxy. In the Test Set (purple), however, these errors are positively correlated, resulting in a smaller net reduction in mass scatter by using $Y_X$.