Table of Contents
Fetching ...

Machine Learning Classification of COSMOS2020 Galaxies: Quiescent vs. Star-Forming

Vahid Asadi, Nima Chartab, Akram Hasani Zonoozi, Hosein Haghi, Ghassem Gozaliasl, Aryana Haghjoo, Bahram Mobasher

TL;DR

This work tackles the challenge of reliably separating quiescent from star-forming galaxies in large surveys by leveraging machine learning trained on realistic mock photometry from the Santa Cruz semi-analytic model. A CatBoostClassifier is trained on 28 color features derived from eight mutual bands between the SAM mocks and the COSMOS2020 data, achieving a quiescent F1-score of 0.888 and AUC of 0.97, while vastly surpassing traditional SED-fitting in both accuracy (notably recall) and speed. When applied to the COSMOS2020 catalog, the ML approach yields a higher inferred quiescent fraction across 0.2 < z < 3.5 and provides a scalable path for large surveys. The study highlights the practical potential of ML methods to improve galaxy population studies, with publicly available trained models and classifications to enable community use.

Abstract

Accurately distinguishing between quiescent and star-forming galaxies is essential for understanding galaxy evolution. Traditional methods, such as spectral energy distribution (SED) fitting, can be computationally expensive and may struggle to capture complex galaxy properties. This study aims to develop a robust and efficient machine learning (ML) classification method to identify quiescent and star-forming galaxies within the Farmer COSMOS2020 catalog. We utilized JWST wide-field light cones from the Santa Cruz semi-analytical modeling framework to train a supervised ML model, the CatBoostClassifier, using 28 color features derived from 8 mutual photometric bands within the COSMOS catalog. The model was validated against a testing set and compared to the SED-fitting method in terms of precision, recall, F1-score, and execution time. Preprocessing steps included addressing missing data, injecting observational noise, and applying a magnitude cut (ch1 < 26 AB) along with a redshift range of 0.2 < z < 3.5 to align the simulated and observational datasets. The ML method achieved an F1-score of 89\% for quiescent galaxies, significantly outperforming the SED-fitting method, which achieved 54%. The ML model demonstrated superior recall (88% vs. 38%) while maintaining comparable precision. When applied to the COSMOS2020 catalog, the ML model predicted a systematically higher fraction of quiescent galaxies across all redshift bins within 0.2 < z < 3.5 compared to traditional methods like NUVrJ and SED-fitting. This study shows that ML, combined with multi-wavelength data, can effectively identify quiescent and star-forming galaxies, providing valuable insights into galaxy evolution. The trained classifier and full classification catalog are publicly available.

Machine Learning Classification of COSMOS2020 Galaxies: Quiescent vs. Star-Forming

TL;DR

This work tackles the challenge of reliably separating quiescent from star-forming galaxies in large surveys by leveraging machine learning trained on realistic mock photometry from the Santa Cruz semi-analytic model. A CatBoostClassifier is trained on 28 color features derived from eight mutual bands between the SAM mocks and the COSMOS2020 data, achieving a quiescent F1-score of 0.888 and AUC of 0.97, while vastly surpassing traditional SED-fitting in both accuracy (notably recall) and speed. When applied to the COSMOS2020 catalog, the ML approach yields a higher inferred quiescent fraction across 0.2 < z < 3.5 and provides a scalable path for large surveys. The study highlights the practical potential of ML methods to improve galaxy population studies, with publicly available trained models and classifications to enable community use.

Abstract

Accurately distinguishing between quiescent and star-forming galaxies is essential for understanding galaxy evolution. Traditional methods, such as spectral energy distribution (SED) fitting, can be computationally expensive and may struggle to capture complex galaxy properties. This study aims to develop a robust and efficient machine learning (ML) classification method to identify quiescent and star-forming galaxies within the Farmer COSMOS2020 catalog. We utilized JWST wide-field light cones from the Santa Cruz semi-analytical modeling framework to train a supervised ML model, the CatBoostClassifier, using 28 color features derived from 8 mutual photometric bands within the COSMOS catalog. The model was validated against a testing set and compared to the SED-fitting method in terms of precision, recall, F1-score, and execution time. Preprocessing steps included addressing missing data, injecting observational noise, and applying a magnitude cut (ch1 < 26 AB) along with a redshift range of 0.2 < z < 3.5 to align the simulated and observational datasets. The ML method achieved an F1-score of 89\% for quiescent galaxies, significantly outperforming the SED-fitting method, which achieved 54%. The ML model demonstrated superior recall (88% vs. 38%) while maintaining comparable precision. When applied to the COSMOS2020 catalog, the ML model predicted a systematically higher fraction of quiescent galaxies across all redshift bins within 0.2 < z < 3.5 compared to traditional methods like NUVrJ and SED-fitting. This study shows that ML, combined with multi-wavelength data, can effectively identify quiescent and star-forming galaxies, providing valuable insights into galaxy evolution. The trained classifier and full classification catalog are publicly available.

Paper Structure

This paper contains 22 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview of the data analysis pipeline used in this study.
  • Figure 2: Distribution of total galaxies and the quiescent galaxy fraction as a function of redshift bins in the mock sample for $M_{QG} \geq 10^{9.5} M_{\odot}$. The plot highlights the significant decline in the fraction of quiescent galaxies at $z > 3.5$.
  • Figure 3: Redshift distributions of training and testing sets.
  • Figure 4: Distribution of training and testing sets in the mock sample, visualized using UMAP. Red and black points
  • Figure 5: Heatmap comparison of the mock (with and without injected noise) and COSMOS2020 galaxy colors (u-F814W and F814W-Y). The injected mock galaxies show similar color distributions and cover the COSMOS2020 galaxies manifold, indicating good representation of real colors.
  • ...and 6 more figures