Table of Contents
Fetching ...

Unsupervised learning for variability detection with Gaia DR3 photometry. The main sequence-white dwarf valley

P. Ranaivomanana, C. Johnston, G. Iorio, P. J. Groot, M. Uzundag, T. Kupfer, C. Aerts

TL;DR

The paper tackles scalable, unsupervised variability detection in Gaia DR3 epoch photometry by applying t-SNE-based dimensionality reduction to a large, unconstrained sample of 13,405 sources in the main-sequence–white-dwarf valley. It extracts 81 light-curve features and uses a Gaussian mixture model to identify 10 clusters corresponding to physical classes such as hot subdwarfs, cataclysmic variables, and eclipsing binaries, with SHAP highlighting the most discriminative features. The approach reveals substructures, including pulsation-mode hints in hot subdwarfs and sub-clusters within CVs, while also exposing data-quality artefacts and crowded-field effects; RUWE-based cuts demonstrate a trade-off between cluster purity and sample size. Overall, the framework proves scalable, interpretable, and capable of guiding discovery of stellar subtypes in upcoming large-time-domain surveys.

Abstract

The unprecedented volume and quality of data from space- and ground-based telescopes present an opportunity for machine learning to identify new classes of variable stars and peculiar systems that may have been overlooked by traditional methods. Extending prior methodological work, this study investigates the potential of an unsupervised learning approach to scale effectively to larger stellar populations, including objects in crowded fields, and without the need for pre-selected catalogues, specifically focusing on 13 405 sources selected from Gaia DR3 and lying in the selected region of the CMD. Our methodology incorporates unsupervised clustering techniques based primarily on statistical features extracted from Gaia DR3 epoch photometry. We used the t-distributed stochastic neighbour embedding (t-SNE) algorithm to identify variability classes, their subtypes, and spurious variability induced by instrumental effects. The clustering results revealed distinct groups, including hot subdwarfs, cataclysmic variables (CVs), eclipsing binaries, and objects in crowded fields, such as those in the Andromeda (M31) field. Several potential stellar subtypes also emerged within these clusters. Notably, objects previously labelled as RR Lyrae were found in an unexpected region of the CMD, potentially due to either unreliable astrometric measurements (e.g., due to binarity) or alternative evolutionary pathways. This study emphasises the robustness of the proposed method in finding variable objects in a large region of the Gaia CMD, including variable hot subdwarfs and CVs, while demonstrating its efficiency in detecting variability in extended stellar populations. The proposed unsupervised learning framework demonstrates scalability to large datasets and yields promising results in identifying stellar subclasses.

Unsupervised learning for variability detection with Gaia DR3 photometry. The main sequence-white dwarf valley

TL;DR

The paper tackles scalable, unsupervised variability detection in Gaia DR3 epoch photometry by applying t-SNE-based dimensionality reduction to a large, unconstrained sample of 13,405 sources in the main-sequence–white-dwarf valley. It extracts 81 light-curve features and uses a Gaussian mixture model to identify 10 clusters corresponding to physical classes such as hot subdwarfs, cataclysmic variables, and eclipsing binaries, with SHAP highlighting the most discriminative features. The approach reveals substructures, including pulsation-mode hints in hot subdwarfs and sub-clusters within CVs, while also exposing data-quality artefacts and crowded-field effects; RUWE-based cuts demonstrate a trade-off between cluster purity and sample size. Overall, the framework proves scalable, interpretable, and capable of guiding discovery of stellar subtypes in upcoming large-time-domain surveys.

Abstract

The unprecedented volume and quality of data from space- and ground-based telescopes present an opportunity for machine learning to identify new classes of variable stars and peculiar systems that may have been overlooked by traditional methods. Extending prior methodological work, this study investigates the potential of an unsupervised learning approach to scale effectively to larger stellar populations, including objects in crowded fields, and without the need for pre-selected catalogues, specifically focusing on 13 405 sources selected from Gaia DR3 and lying in the selected region of the CMD. Our methodology incorporates unsupervised clustering techniques based primarily on statistical features extracted from Gaia DR3 epoch photometry. We used the t-distributed stochastic neighbour embedding (t-SNE) algorithm to identify variability classes, their subtypes, and spurious variability induced by instrumental effects. The clustering results revealed distinct groups, including hot subdwarfs, cataclysmic variables (CVs), eclipsing binaries, and objects in crowded fields, such as those in the Andromeda (M31) field. Several potential stellar subtypes also emerged within these clusters. Notably, objects previously labelled as RR Lyrae were found in an unexpected region of the CMD, potentially due to either unreliable astrometric measurements (e.g., due to binarity) or alternative evolutionary pathways. This study emphasises the robustness of the proposed method in finding variable objects in a large region of the Gaia CMD, including variable hot subdwarfs and CVs, while demonstrating its efficiency in detecting variability in extended stellar populations. The proposed unsupervised learning framework demonstrates scalability to large datasets and yields promising results in identifying stellar subclasses.

Paper Structure

This paper contains 22 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Colour–magnitude diagrams, with grey background points representing all selected Gaia DR3 sources within 1 kpc. Left panel: blue points show the 18,085 initial targets drawn from the grey background sources within the black dash-dotted polygon. The dashed grey polygon marks the region from which the targets in Paper I were selected. Right panel: the identified stellar classes among the 13,405 final targets within the same black dash-dotted polygon, namely hot subdwarfs from Paper I (orange circles), eclipsing binaries from Gaia classification (blue squares), solar-like rotational modulation stars from Gaia classification (brown stars), CVs from Canbay2023 catalogue (green triangles), white dwarfs from the SIMBAD database (purple diamonds), and hot subdwarfs from Culpan2022 catalogue. The dashed grey polygon indicates the freely selected target region.
  • Figure 3: Number of known objects per cluster without a RUWE cut (left) and with the RUWE$<$1.4 cut applied (right). The x-axis (Cluster) shows the clusters defined in Fig. \ref{['subfig:c']} and Fig. \ref{['subfig:f']}, while the y-axis indicates the object types found in each cluster, as described in Table \ref{['tab:obj_definition']}.
  • Figure 4: SHapley Additive exPlanations (SHAP) values for the most important features in predicting each cluster: the top panel shows the highest-ranked feature, and the bottom panel shows the second-most important. SHAP values are expressed in log-odds units.
  • Figure 5: Gaia G-band period distribution per cluster.
  • Figure 6: Close up view of the t-SNE embeddings for HSD (left panel) and CVs (right panel) clusters. Left panel: HSD sub-clusters 0 and 1 represent the cluster HSD in Fig. \ref{['fig:tsne_51_feat']}–c, where p-mode hot subdwarfs were identified from Baran2024, while the other modes (g, p+g, g mode + binary) were taken from Krzesinski2022. Right panel: Magnetic CVs (mCVs), non-magnetic CVs (non-mCVs), and dwarf novae (DN) from Canbay2023 are shown.
  • ...and 11 more figures