Table of Contents
Fetching ...

Identifying Evolutionary Stages of Molecular Clumps through Unsupervised and Supervised Machine Learning

K. V. Plakitina, M. S. Kirsanova, A. B. Ostrovskii, A. D. Gimalieva, S. V. Salii, A. V. Meshcheryakov

TL;DR

This work demonstrates the capability of machine learning as a complementary, data-driven approach to automate the identification and classification of molecular clumps using data from the MALT90 survey, complemented by Spitzer IR photometry.

Abstract

The evolutionary classification of molecular clumps, crucial for understanding star formation, is commonly based on human-assigned categories derived from infrared (IR) emission and well-established morphological criteria. However, due to ambiguous signatures, distance uncertainties or heavily obscured IR emission, a significant fraction of sources often remains unclassified. This work demonstrates the capability of machine learning (ML) as a complementary, data-driven approach to automate the identification and classification of these clumps using data from the MALT90 survey, complemented by Spitzer IR photometry. We applied unsupervised clustering with HDBSCAN on molecular line intensities, revealing distinct groupings that correspond to evolutionary stages. Using only five molecular lines (HCO$^+$, HNC, N$_2$H$^+$, HCN, C$_2$H), we identified stable clusters of protostars and regions without active star formation, driven primarily by C$_2$H and N$_2$H$^+$ emission. Incorporating H$^{13}$CO$^+$ gave rise to a distinct UV-dominant cluster, tracing more evolved regions. Infrared properties appeared as non-significant features implying that envelopes of clumps with different masses are similar in their global infrared characteristics. We then employed supervised learning to classify clumps with previously uncertain categories and provided classifications for 522 objects, predominantly as regions without active star formation. Our results show that ML techniques can effectively uncover intrinsic evolutionary structures in complex astrochemical data and assign categories to uncertain sources, providing a powerful, data-driven complement to traditional methods.

Identifying Evolutionary Stages of Molecular Clumps through Unsupervised and Supervised Machine Learning

TL;DR

This work demonstrates the capability of machine learning as a complementary, data-driven approach to automate the identification and classification of molecular clumps using data from the MALT90 survey, complemented by Spitzer IR photometry.

Abstract

The evolutionary classification of molecular clumps, crucial for understanding star formation, is commonly based on human-assigned categories derived from infrared (IR) emission and well-established morphological criteria. However, due to ambiguous signatures, distance uncertainties or heavily obscured IR emission, a significant fraction of sources often remains unclassified. This work demonstrates the capability of machine learning (ML) as a complementary, data-driven approach to automate the identification and classification of these clumps using data from the MALT90 survey, complemented by Spitzer IR photometry. We applied unsupervised clustering with HDBSCAN on molecular line intensities, revealing distinct groupings that correspond to evolutionary stages. Using only five molecular lines (HCO, HNC, NH, HCN, CH), we identified stable clusters of protostars and regions without active star formation, driven primarily by CH and NH emission. Incorporating HCO gave rise to a distinct UV-dominant cluster, tracing more evolved regions. Infrared properties appeared as non-significant features implying that envelopes of clumps with different masses are similar in their global infrared characteristics. We then employed supervised learning to classify clumps with previously uncertain categories and provided classifications for 522 objects, predominantly as regions without active star formation. Our results show that ML techniques can effectively uncover intrinsic evolutionary structures in complex astrochemical data and assign categories to uncertain sources, providing a powerful, data-driven complement to traditional methods.
Paper Structure (26 sections, 10 figures, 4 tables)

This paper contains 26 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Detection patterns of five most common molecular lines across MALT90 catalogue. Each row represents a molecule, and each column represents a specific combination of detected molecules. Grey dots indicate non-detections, while black dots indicate detected lines. Black dots within a column are connected to show which molecules are detected together. The horizontal bars on the left indicate the number of sources in which each molecule is detected, while the vertical bars on top represent the number of sources with each specific combination of molecules ("intersection size").
  • Figure 2: Result of HDBSCAN clustering based on integrated intensity values of five molecules: HCO^+, HNC, N2H^+, HCN and C2H. Clusters are shown in different colours, with stable clusters labelled according to the most abundant category within each cluster. The t-SNE axes (comp1 and comp2) represent t-SNE result dimensions. The legend shows clusters in the descending order and the cluster labelled as -1 represent outliers. The main diagonal shows the distributions of the residual dimensions for the clusters, represented as kernel density estimate (KDEs). Peaks roughly indicate modes, and the spread provides a visual sense of variance.
  • Figure 3: The distribution of object types within the two most stable and populated clusters identified with integrated intensities values of five molecules: HCO^+, HNC, N2H^+, HCN and C2H. Total number of objects in the each cluster is shown in the bottom right corners of the panels. Each bar represents a certain object type from MALT90 catalogue: Q for quiescent, A for protostellar, C for compact regions, H for extended regions, P for photo-dissociation regions and U for uncertain classifications. The percentage indicate the fraction of certain object type within a cluster, while he actual number of objects is shown along the x-axis.
  • Figure 4: Random Forest-derived feature importance for the classification model based on integrated intensities of five molecular lines.
  • Figure 5: Result of HDBSCAN clustering based on integrated intensity values of six molecules: HCO^+, HNC, N2H^+, HCN and C2H, H^13CO^+ and Spitzer IR emission. Comp1 and comp2 do not correspond to those in Figure \ref{['fig:clusters_1']}
  • ...and 5 more figures