Table of Contents
Fetching ...

Identification of molecular line emission using Convolutional Neural Networks

Nina Kessler, Timea Csengeri, David Cornu, Sylvain Bontemps, Laure Bouscasse

TL;DR

This paper tackles the problem of identifying molecular line emission from complex organic molecules in line-rich millimeter spectra. It introduces a convolutional neural network trained on LTE-synthesized spectra spanning 20 molecules in the 3 mm band ($80$--$115$ GHz) to output detection probabilities for multiple species simultaneously. The authors demonstrate robust performance on synthetic data, calibrate model scores to probabilistic detections, and explore resilience to noise, line density, and incomplete frequency coverage, including transfer learning to real observational setups. Application to archival IRAM data shows the method's potential to rapidly infer molecular inventories, while acknowledging limitations due to real-world spectral complexity and the need for expanded training sets and transfer learning.

Abstract

Complex organic molecules (COMs) are observed to be abundant in various astrophysical environments, in particular toward star forming regions they are observed both toward protostellar envelopes as well as shocked regions. Emission spectrum especially of heavier COMs may consists of up to hundreds of lines, where line blending hinders the analysis. However, identifying the molecular composition of the gas leading to the observed millimeter spectra is the first step toward a quantitative analysis. We develop a new method based on supervised machine learning to recognize spectroscopic features of the rotational spectrum of molecules in the 3mm atmospheric transmission band for a list of species including COMs with the aim to obtain a detection probability. We used local thermodynamic equilibrium (LTE) modeling to build a large set of synthetic spectra of 20 molecular species including COMs with a range of physical conditions typical for star forming regions. We successfully designed and trained a Convolutional Neural Network (CNN) that provides detection probabilities of individual species in the spectra. We demonstrate that the produced CNN-model has a robust performance to detect spectroscopic signatures from these species in synthetic spectra. We evaluate its ability to detect molecules according to the noise level, frequency coverage, and line-richness, and also test its performance for incomplete frequency coverage with high detection probabilities for the tested parameter space, and no false predictions. Ultimately, we apply the CNN-model to obtain predictions on observational data from the literature toward line-rich hot-core like sources, where detection probabilities remain reasonable with no false detection. We prove the use of CNNs facilitating the analysis of complex millimeter spectra both on synthetic spectra as well as first tests on observational data.

Identification of molecular line emission using Convolutional Neural Networks

TL;DR

This paper tackles the problem of identifying molecular line emission from complex organic molecules in line-rich millimeter spectra. It introduces a convolutional neural network trained on LTE-synthesized spectra spanning 20 molecules in the 3 mm band (-- GHz) to output detection probabilities for multiple species simultaneously. The authors demonstrate robust performance on synthetic data, calibrate model scores to probabilistic detections, and explore resilience to noise, line density, and incomplete frequency coverage, including transfer learning to real observational setups. Application to archival IRAM data shows the method's potential to rapidly infer molecular inventories, while acknowledging limitations due to real-world spectral complexity and the need for expanded training sets and transfer learning.

Abstract

Complex organic molecules (COMs) are observed to be abundant in various astrophysical environments, in particular toward star forming regions they are observed both toward protostellar envelopes as well as shocked regions. Emission spectrum especially of heavier COMs may consists of up to hundreds of lines, where line blending hinders the analysis. However, identifying the molecular composition of the gas leading to the observed millimeter spectra is the first step toward a quantitative analysis. We develop a new method based on supervised machine learning to recognize spectroscopic features of the rotational spectrum of molecules in the 3mm atmospheric transmission band for a list of species including COMs with the aim to obtain a detection probability. We used local thermodynamic equilibrium (LTE) modeling to build a large set of synthetic spectra of 20 molecular species including COMs with a range of physical conditions typical for star forming regions. We successfully designed and trained a Convolutional Neural Network (CNN) that provides detection probabilities of individual species in the spectra. We demonstrate that the produced CNN-model has a robust performance to detect spectroscopic signatures from these species in synthetic spectra. We evaluate its ability to detect molecules according to the noise level, frequency coverage, and line-richness, and also test its performance for incomplete frequency coverage with high detection probabilities for the tested parameter space, and no false predictions. Ultimately, we apply the CNN-model to obtain predictions on observational data from the literature toward line-rich hot-core like sources, where detection probabilities remain reasonable with no false detection. We prove the use of CNNs facilitating the analysis of complex millimeter spectra both on synthetic spectra as well as first tests on observational data.

Paper Structure

This paper contains 35 sections, 3 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Synthetic spectrum of a classical hot core computed from Table \ref{['tab:parameters_classical_hot_core']} with a 5 km s$^{-1}$ line width, a $50$ mK Gaussian noise and a zoom on 400 MHz. The LTE models are in colors. The mask computed with the molecules from Table \ref{['tab:small_molecules']} is in gray. The fake lines and absorption features are in blue and red respectively.
  • Figure 2: Scheme of the CNN architecture. The input data is an example of a composite spectrum according to the three normalizations, i.e., by the maximum (top), hyperbolic tangent (center), and polynomial (bottom). Filters are applied to the data to convolve the information and produce features maps. This operation is done for each of the convolutional layers. Dense layers then combine the extracted features and learn how to label the spectra depending on the provided target. The output layer is composed of one neuron per class giving a score between 0 and 1 independent between each other.
  • Figure 3: Loss function computed on the validation dataset and AUC values as a function of training iterations with specified minimum loss and the corresponding AUC.
  • Figure 4: ROC curves of the molecules for which the CNN learned to detect their spectral signature. The values were computed on a [0, 1] range from the $x$ and $y$-axis.
  • Figure 5: AUC as a function of molecules for three trainings of the multi-labeling CNN. The squares, triangles and stars are the AUC values. The red crosses correspond to the number of spectra where the molecules are detected.
  • ...and 8 more figures