Table of Contents
Fetching ...

Mapping Synthetic Observations to Prestellar Core Models: An Interpretable Machine Learning Approach

T. Grassi, M. Padovani, D. Galli, N. Vaytet, S. S. Jensen, E. Redaelli, S. Spezzano, S. Bovino, P. Caselli

TL;DR

This study builds a pipeline to map synthetic prestellar-core spectra to underlying physical properties by combining a 1D isothermal collapse model, thermochemical evolution, LOC radiative transfer, and SHAP-based interpretability. The authors demonstrate that most physical parameters are recoverable from spectra, notably constraining the cosmic-ray ionization rate and its radial profile via lines from species such as N$_2$H$^+$, N$_2$D$^+$, and DCO$^+$, while a few quantities like the total mass and velocity dispersion are harder to pin down. The backward emulation framework, paired with SHAP, enables rapid, interpretable inference and identification of spectral features that drive parameter predictions, offering a method to quantify information loss in observations. The work provides a flexible, generalizable approach for linking spectral data to core properties and highlights limitations related to geometry, chemistry, and emulator applicability that future improvements can address.

Abstract

Observations of molecular lines are a key tool to determine the main physical properties of prestellar cores. However, not all the information is retained in the observational process or easily interpretable, especially when a larger number of physical properties and spectral features are involved. We present a methodology to link the information in the synthetic spectra with the actual information in the simulated models (i.e., their physical properties), in particular, to determine where the information resides in the spectra. We employ a 1D gravitational collapse model with advanced thermochemistry, from which we generate synthetic spectra. We then use neural network emulations and the SHapley Additive exPlanations (SHAP), a machine learning technique, to connect the models' properties to the specific spectral features. Thanks to interpretable machine learning, we find several correlations between synthetic lines and some of the key model parameters, such as the cosmic-ray ionization radial profile, the central density, or the abundance of various species, suggesting that most of the information is retained in the observational process. Our procedure can be generalized to similar scenarios to quantify the amount of information lost in the real observations. We also point out the limitations for future applicability.

Mapping Synthetic Observations to Prestellar Core Models: An Interpretable Machine Learning Approach

TL;DR

This study builds a pipeline to map synthetic prestellar-core spectra to underlying physical properties by combining a 1D isothermal collapse model, thermochemical evolution, LOC radiative transfer, and SHAP-based interpretability. The authors demonstrate that most physical parameters are recoverable from spectra, notably constraining the cosmic-ray ionization rate and its radial profile via lines from species such as NH, ND, and DCO, while a few quantities like the total mass and velocity dispersion are harder to pin down. The backward emulation framework, paired with SHAP, enables rapid, interpretable inference and identification of spectral features that drive parameter predictions, offering a method to quantify information loss in observations. The work provides a flexible, generalizable approach for linking spectral data to core properties and highlights limitations related to geometry, chemistry, and emulator applicability that future improvements can address.

Abstract

Observations of molecular lines are a key tool to determine the main physical properties of prestellar cores. However, not all the information is retained in the observational process or easily interpretable, especially when a larger number of physical properties and spectral features are involved. We present a methodology to link the information in the synthetic spectra with the actual information in the simulated models (i.e., their physical properties), in particular, to determine where the information resides in the spectra. We employ a 1D gravitational collapse model with advanced thermochemistry, from which we generate synthetic spectra. We then use neural network emulations and the SHapley Additive exPlanations (SHAP), a machine learning technique, to connect the models' properties to the specific spectral features. Thanks to interpretable machine learning, we find several correlations between synthetic lines and some of the key model parameters, such as the cosmic-ray ionization radial profile, the central density, or the abundance of various species, suggesting that most of the information is retained in the observational process. Our procedure can be generalized to similar scenarios to quantify the amount of information lost in the real observations. We also point out the limitations for future applicability.

Paper Structure

This paper contains 18 sections, 3 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: The procedure employed in this work. The top part represents the modeling steps: (1) Generate the library of gravitational collapse models, (2) randomly select the base models depending on the global parameters, (3) evolve the thermochemistry in the collapse model, (4) include additional chemistry with post-processing, (5) obtain the derived quantities, for example, the abundances of some key species, and (6) produce the synthetic spectra. The middle part is the emulation: (7a) is the forward emulation, from parameters to spectra, while (7) is the backward emulation, from spectra to parameters. The sensitivity using SHAP in the bottom is where we "perturb" the neural network input features to determine the impact on the outputs (8).
  • Figure 2: Example model from the population synthesis set at $t_{\rm max}$. The $x$-axis of each panel spans approximately 0.3 pc, or 350" assuming a cloud distant 170 pc from the observer. For the sake of clarity, this plot shows a smaller inner region of the actual computational domain. Upper left panel: total number density radial profile (blue, leftmost $y$-scale), cosmic-ray ionization rate $\zeta$ (orange, rightmost $y$-scale). Upper right: gas (orange) and dust (blue) temperature radial profiles (both on leftmost $y$-scale), and radial velocity profile (green, rightmost $y$-scale). Lower left: Cooling and heating contributions, in particular $\Lambda_{\rm d}$ is the dust cooling, $\Lambda_{\ce{CO}}$ is the CO cooling, $\Lambda_{\rm Z}$ is the cooling from atomic species (C, C+, and O), while $\Gamma_{\rm ad}$ is the adiabatic heating (i.e., compressional heating), $\Gamma_{\rm CR}$ the cosmic-ray heating, and $\Gamma_{\rm phe}$ the dust photoelectric heating. The cooling and heating contributions below $10^{-26}$ erg s$^{-1}$ cm$^{-3}$ are not reported in the legend. Lower right: the radial profile of a subset of the chemical species computed with the 4406-reactions network. The fractional abundance is relative to the total number density. CO$_{\rm d}$ (dashed green) is the CO on the dust surface. The comparison with Sipila2018 is reported in Appendix \ref{['sect:comparison']}.
  • Figure 3: Correlation matrix between models' parameters. The panels in the upper left triangle show the Pearson correlation coefficient color-coded, as indicated by the color bar. The plots in the lower right triangle show the 2D correlation histograms of the parameters of the 3000 generated models. The panels in the diagonal represent the histogram of the parameter distribution. The correlation for quantities marked with "*" is computed using their logarithm. Note that the color bar is clipped to 0.2 to enhance the stronger correlations.
  • Figure 4: Example spectra for the model in Fig. \ref{['fig:model']}. Each panel reports the spectra of the molecule and the transitions indicated in the legend. The spectra are convoluted with a telescope beam, as discussed in the main text. Note that the $y$-axis is scaled to the value on the top of the panel, e.g., C^18O temperature is scaled to $10^{-1}$. All the spectra are calculated with a $-2$ to $+2$ km s$^{-1}$ bandwidth range and 128 channels, except for N2H+, N2D+, and HCN where the bandwidth is calculated by LOC to take into account the hyperfine structure of these molecules, but interpolated to use 128 channels. We also use the spectra centered according to the LOC output (i.e., the position of 0 km s$^{-1}$).
  • Figure 5: Training (blue) and validation (orange) total loss at different epochs for the backward emulation training. Note that although the total loss does not become constant, the individual losses of the physical parameters (not reported here) become constant after a few epochs, apart from the turbulence velocity dispersion ($\sigma_\varv$) and the total mass ($M$), that drive the total loss.
  • ...and 10 more figures