Table of Contents
Fetching ...

Wide Area VISTA Extragalactic Survey (WAVES): Selection of targets for the Wide survey using decision-tree classification

G. Kaur, M. Bilicki, S. Bellstedt, E. Tempel, W. A. Hellwing, I. Baldry, B. Bandi, S. Barsanti, S. Driver, N. Guerra-Varas, B. Holwerda, C. Lagos, J. Loveday, A. Robotham

TL;DR

This work presents a supervised learning approach to WAVES-Wide target selection that bypasses explicit photometric redshift estimation by predicting the probability that a galaxy lies below $z=0.2$ using a binary classifier trained on spectroscopic labels. Utilizing 9-band KiDS+VISTA photometry and 36 color combinations (45 features in total), an XGBoost classifier is trained on a spec-z catalog and evaluated with 5-fold cross-validation, achieving about 95% purity and completeness for the fiducial cuts $Z<21.1$ and $z<0.2$, with the probability threshold $P>0.5$ providing a balanced trade-off. The classifier's performance degrades near selection limits and depends on S/N and color, but SHAP and permutation-importance analyses identify $g-r$, $u-g$, $g$, and $J-K_s$ as highly informative features. When applied to the full photo-catalog, the method yields ~2.6 million targets at $P>0.5$, illustrating the need to tune probability thresholds to control sample size and contamination, and highlighting extrapolation risks due to incomplete spectroscopic coverage. The approach is designed to complement other WAVES target-selection methods in a joint framework, offering a scalable path to constructing a flux- and redshift-limited spectroscopic sample for the 4MOST instrument, with room for improvement as more labeled data become available.

Abstract

The Wide-Area VISTA Extragalactic Survey (WAVES) on the 4-metre Multi-Object Spectroscopic Telescope (4MOST) includes two flux-limited subsurveys with very high (95\%) completeness requirements: Wide over $\sim\!1200$ deg$^2$ and Deep over $\sim\!65$ deg$^2$. Both are $Z$-band selected, respectively as $Z<21.1$ and $Z<21.25$ mag, and additionally redshift-limited, while the true redshifts are not known a priori but will be only measured by 4MOST. Here, we present a classification-based method to select the targets for WAVES-Wide. Rather than estimating individual redshifts for the input photometric objects, we assign probabilities of them being below $z=0.2$, the redshift limit of the subsurvey. This is done with the supervised machine learning approach of eXtreme Gradient Boosting (XGB), trained on a comprehensive spectroscopic sample overlapping with WAVES fields. Our feature space is composed of nine VST+VISTA magnitudes from $u$ to $K_s$ and all the possible colors, but most relevant for the classification are the $g$-band and the $u-g$, $g-r$ and $J-K_s$ colors. We check the performance of our classifier both for the fiducial WAVES-Wide limits, as well as for a range of neighboring redshift and magnitude thresholds, consistently finding purity and completeness at the level of 94-95\%. We note, however, that this performance deteriorates for sources close to the selection limits, due to deficiencies of the current spectroscopic training sample and the decreasing signal-to-noise of the photometry. We apply the classifier trained on the full spectroscopic sample to 14 million photometric galaxies from the WAVES input catalog, which have all 9 bands measured. Our work demonstrates that a machine-learning classifier could be used to select a flux- and redshift-limited sample from deep photometric data.

Wide Area VISTA Extragalactic Survey (WAVES): Selection of targets for the Wide survey using decision-tree classification

TL;DR

This work presents a supervised learning approach to WAVES-Wide target selection that bypasses explicit photometric redshift estimation by predicting the probability that a galaxy lies below using a binary classifier trained on spectroscopic labels. Utilizing 9-band KiDS+VISTA photometry and 36 color combinations (45 features in total), an XGBoost classifier is trained on a spec-z catalog and evaluated with 5-fold cross-validation, achieving about 95% purity and completeness for the fiducial cuts and , with the probability threshold providing a balanced trade-off. The classifier's performance degrades near selection limits and depends on S/N and color, but SHAP and permutation-importance analyses identify , , , and as highly informative features. When applied to the full photo-catalog, the method yields ~2.6 million targets at , illustrating the need to tune probability thresholds to control sample size and contamination, and highlighting extrapolation risks due to incomplete spectroscopic coverage. The approach is designed to complement other WAVES target-selection methods in a joint framework, offering a scalable path to constructing a flux- and redshift-limited spectroscopic sample for the 4MOST instrument, with room for improvement as more labeled data become available.

Abstract

The Wide-Area VISTA Extragalactic Survey (WAVES) on the 4-metre Multi-Object Spectroscopic Telescope (4MOST) includes two flux-limited subsurveys with very high (95\%) completeness requirements: Wide over deg and Deep over deg. Both are -band selected, respectively as and mag, and additionally redshift-limited, while the true redshifts are not known a priori but will be only measured by 4MOST. Here, we present a classification-based method to select the targets for WAVES-Wide. Rather than estimating individual redshifts for the input photometric objects, we assign probabilities of them being below , the redshift limit of the subsurvey. This is done with the supervised machine learning approach of eXtreme Gradient Boosting (XGB), trained on a comprehensive spectroscopic sample overlapping with WAVES fields. Our feature space is composed of nine VST+VISTA magnitudes from to and all the possible colors, but most relevant for the classification are the -band and the , and colors. We check the performance of our classifier both for the fiducial WAVES-Wide limits, as well as for a range of neighboring redshift and magnitude thresholds, consistently finding purity and completeness at the level of 94-95\%. We note, however, that this performance deteriorates for sources close to the selection limits, due to deficiencies of the current spectroscopic training sample and the decreasing signal-to-noise of the photometry. We apply the classifier trained on the full spectroscopic sample to 14 million photometric galaxies from the WAVES input catalog, which have all 9 bands measured. Our work demonstrates that a machine-learning classifier could be used to select a flux- and redshift-limited sample from deep photometric data.

Paper Structure

This paper contains 17 sections, 12 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Distribution of galaxies from the spectroscopic catalog, that we use to calibrate our classifier, on the $u-g$ vs. $g-r$ color plane. Each pixel is colored according to its median redshift. We show only the pixels with a minimum of 10 galaxies each.
  • Figure 2: Dependence of the median redshift (left-hand axis, violet diagonal line) of galaxies selected as flux-limited up to the $Z$-band magnitude indicated on the x-axis, as derived from a Shark/SURFS mock galaxy catalog. The right-hand axis and the green lines show what percentage of galaxies would constitute the WAVES targets for the $Z$-band flux limit as on the x-axis and the fiducial redshift limits: $z<0.2$ for WW (bottom green line) and $z<0.8$ for WD (top green line).
  • Figure 3: Number density of galaxies projected on the $u-g$ vs. $g-r$ color plane, respectively for the spec-z-cat (left) that we use to calibrate our ML model, and for the entire WAVES photo-cat (right), from which WW targets are selected. We show only the pixels with a minimum of 5 galaxies each for the spec-z-cat (left) and a minimum of 100 sources each for the photo-cat (right).
  • Figure 4: Comparison of number counts as a function of the $Z$-band magnitude in the full WAVES photometric catalog (blue line) and the training set with spectroscopic redshift labels (yellow line). The scale on the y-axis gives actual counts per bin for the photo-cat, while for the spec-z-cat, the counts have been artificially rescaled to give a good match at the bright-end.
  • Figure 5: Comparison of redshift distributions between our spectroscopic training set (yellow) and the SHARK mock catalog (blue), for galaxies at the flux limit of $Z<21.1$ mag. The vertical line indicates the fiducial WW redshift limit for the targets. The redshift distribution of the mock catalog is rescaled to match the counts of the spec-z-cat at low redshift.
  • ...and 12 more figures