Table of Contents
Fetching ...

QHSC: The Quasar Candidate Catalog for the Hyper Suprime-Cam Subaru Strategic Program

Rui Zhu, Xue-Bing Wu, Yuxuan Pang, Yuming Fu

TL;DR

The paper presents QHSC, a deep, ML-driven catalog of quasar candidates in the HSC-SSP Wide survey built from four photometric parent samples and evaluated with multiple deep spectroscopic datasets. It employs XGBoost classifiers for quasar selection and a bagging-XGBoost regressor for photometric redshift estimation, achieving high completeness (>$85\%$) and substantial purity, especially when mid-infrared data from WISE are included. Near-infrared data from UKIDSS/VISTA and, to a lesser extent, SCUSS $u$-band data further improve redshift estimates and reduce catastrophic outliers to around $\sim 10\%$ in optimized samples. The resulting QHSC catalog, publicly available, supports studies of quasars and cosmology and demonstrates the viability of ensemble ML approaches for quasar selection in upcoming wide/deep imaging surveys.

Abstract

The Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP) is a deep wide-field multi-band imaging survey consisting of three layers (Wide, Deep, and UltraDeep), with the Wide layer covering $\sim 1470$ deg$^2$ to a depth of $i \sim 26$ mag. We present the QHSC catalog, a machine-learning selected sample of quasar candidates with photometric redshifts in the Wide layer of the HSC-SSP survey (Public Data Release 3). The full QHSC catalog contains four distinct samples: a master sample with HSC-only photometry, an HSC+WISE sample, and two samples including near-infrared data from UKIDSS and VISTA, denoted as GoldenU and GoldenV. For each sample, an XGBoost classifier is trained and evaluated using independent spectroscopic test sets from HETDEX, VVDS, and zCOSMOS-bright. The numbers of quasar candidates in the QHSC catalog are 1,184,574 (master), 371,777 (HSC+WISE), 87,460 (GoldenU), and 120,572 (GoldenV), with respective completeness values of 85.3%, 92.7%, 89.8%, and 91.3%. We develop ensemble photometric redshift estimators based on bootstrap aggregating (bagging) of multiple XGBoost regressors, achieving outlier fractions of 21.7%, 13.1%, 9.5%, and 10.7% for these samples. The catalog provides quasar classification probabilities (p_QSO), enabling construction of purer subsamples via thresholding. This work offers a valuable resource for studies of quasars and cosmology, and highlights the effectiveness of machine learning for quasar selection in future wide and deep imaging surveys. The catalog is publicly available at https://doi.org/10.5281/zenodo.17515028.

QHSC: The Quasar Candidate Catalog for the Hyper Suprime-Cam Subaru Strategic Program

TL;DR

The paper presents QHSC, a deep, ML-driven catalog of quasar candidates in the HSC-SSP Wide survey built from four photometric parent samples and evaluated with multiple deep spectroscopic datasets. It employs XGBoost classifiers for quasar selection and a bagging-XGBoost regressor for photometric redshift estimation, achieving high completeness (>) and substantial purity, especially when mid-infrared data from WISE are included. Near-infrared data from UKIDSS/VISTA and, to a lesser extent, SCUSS -band data further improve redshift estimates and reduce catastrophic outliers to around in optimized samples. The resulting QHSC catalog, publicly available, supports studies of quasars and cosmology and demonstrates the viability of ensemble ML approaches for quasar selection in upcoming wide/deep imaging surveys.

Abstract

The Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP) is a deep wide-field multi-band imaging survey consisting of three layers (Wide, Deep, and UltraDeep), with the Wide layer covering deg to a depth of mag. We present the QHSC catalog, a machine-learning selected sample of quasar candidates with photometric redshifts in the Wide layer of the HSC-SSP survey (Public Data Release 3). The full QHSC catalog contains four distinct samples: a master sample with HSC-only photometry, an HSC+WISE sample, and two samples including near-infrared data from UKIDSS and VISTA, denoted as GoldenU and GoldenV. For each sample, an XGBoost classifier is trained and evaluated using independent spectroscopic test sets from HETDEX, VVDS, and zCOSMOS-bright. The numbers of quasar candidates in the QHSC catalog are 1,184,574 (master), 371,777 (HSC+WISE), 87,460 (GoldenU), and 120,572 (GoldenV), with respective completeness values of 85.3%, 92.7%, 89.8%, and 91.3%. We develop ensemble photometric redshift estimators based on bootstrap aggregating (bagging) of multiple XGBoost regressors, achieving outlier fractions of 21.7%, 13.1%, 9.5%, and 10.7% for these samples. The catalog provides quasar classification probabilities (p_QSO), enabling construction of purer subsamples via thresholding. This work offers a valuable resource for studies of quasars and cosmology, and highlights the effectiveness of machine learning for quasar selection in future wide and deep imaging surveys. The catalog is publicly available at https://doi.org/10.5281/zenodo.17515028.

Paper Structure

This paper contains 30 sections, 10 equations, 13 figures.

Figures (13)

  • Figure 1: The full set of filter transmission curves used in this work, normalized to a maximum transmission of one. Blue and red lines represent the HSC and WISE filters, respectively. Cyan lines represent near-infrared filters from UKIDSS (solid) and VISTA (dashed).
  • Figure 2: Histograms of apparent magnitudes for each parent samples in the HSC optical bands ($g$, $r$, $i$, $z$, $y$), UKIDSS near-IR bands ($J$, $H$, $K$), VISTA near-IR bands ($J$, $H$, $K_{\mathrm{s}}$), and WISE mid-IR bands ($W1$, $W2$). The y-axis represents the number density. The quantity of each parent sample is indicated in the legend.
  • Figure 3: Simplified workflow of the QHSC pipeline. The pipeline starts by cross-matching the HSC data with inferred photometric catalogs (e.g., CatWISE2020, UKIDSS, and VISTA surveys) to build four parent samples. Each sample is used to train a separate XGBoost classifier, which selects quasar candidates and estimates their photo-$z$ using the bagging XGBoost method. The final output of the pipeline consists of the quasar candidate catalogs, along with their photo-$z$ estimates.
  • Figure 4: Distributions of $i$ band magnitude (number density) for the training and testing sets of each parent sample. The first row shows the combined SDSS, DESI, and VLMS samples used for model training and evaluation with a random 9:1 train test split. The remaining three rows, HETDEX, VVDS, and zCOSMOS, serve as independent test sets.
  • Figure 5: Normalized confusion matrices of XGBoost classifiers for each parent sample, evaluated on the random, HETDEX, VVDS, and zCOSMOS test sets. The color scale reflects the number of sources. Diagonal elements denote recall (completeness) for each class and off-diagonal elements represent misclassification rates.
  • ...and 8 more figures